Basketball's Hidden Playbook: Deciphering the Game with Data¶

Spring 2024 Data Science Project¶

Collaborators: Adiah Amadou, Ryan Kemajou

Group Members and Contributions¶

Contribution Checkpoints¶

A: Project idea - 5%
B: Dataset Curation and Preprocessing - 10%
C: Data Exploration and Summary Statistics - 10%
D: ML Algorithm Design/Development - 25%
E: ML Algorithm Training and Test Data Analysis - 20%
F: Visualization, Result Analysis, Conclusion - 15%
G: Final Tutorial Report Creation - 10%
H: Additional (not listed above, if any) - 5%

Member 1: Adiah Amadou, Contribution: 100%.

Member 2: Ryan Kemajou, Contribution: 100%.

Detailed contribution¶

Adiah Amadou: Throughout the project, my contributions spanned various crucial aspects. Firstly, I collaborated on crafting the introduction, ensuring clarity and relevance to set the stage for our work. Subsequently, I delved into data cleaning and exploratory analysis, employing techniques such as linear regression and mean mode median calculations to derive insights. My involvement extended to the meticulous formatting, layout, and typing of the final project, ensuring its presentation was professional and cohesive. Additionally, I played a pivotal role in refining the machine learning model, troubleshooting bugs, and incorporating features to enhance accuracy. Prior to implementation, I participated in researching the project idea and curated valuable sources to underpin our work. Furthermore, my contributions extended beyond technical tasks; I facilitated insightful conclusions and insights at each juncture, leveraging my understanding of the project's nuances. Moreover, I actively engaged in fostering a conducive workflow by posing pertinent questions aimed at optimizing outcomes. Overall, my multifaceted involvement contributed significantly to the project's success, amalgamating technical proficiency with strategic thinking.

Ryan Kemajou: In the recent data science project, I assisted in every stage of the project lifecycle. I contributed to the identification of the project theme, ensuring it was relevant and impactful. I conducted extensive research to locate valuable datasets that would serve as the backbone of our analysis. I meticulously cleaned a portion of the data before integrating it into our shared OneDrive. Once the datasets were distributed among the team, I further cleaned my assigned portion, preparing it for analysis. I proposed insightful questions for analysis post data cleaning and suggested effective ways to visualize our findings. I joined in brainstorming sessions to determine the most suitable statistical tests for our project. I assisted in exploring the data, uncovering patterns and insights that could guide our project. I actively participated in discussions on selecting the most appropriate model for our project and provided supplementary materials such as articles and youtube videos to enhance the team's understanding of the task more specifically the use of random forest classification. Lastly, I helped ensure our final tutorial was well-documented with clear and concise comments, enhancing its readability and understandability.

I- Introduction¶

In the competitive world of professional basketball, the ability to predict future outcomes isn't just about entertainment; it's about gaining a strategic advantage. Imagine coaches being able to forecast which NBA team(s) will dominate the next season. With such insights, coaches can fine-tune their training regimes, refine game strategies, and make informed decisions about player acquisitions. Players themselves can set more targeted performance goals, aiming for the stats that truly matter. Fans and bettors, too, can benefit greatly from such predictions, knowing which teams to support and why.

Harnessing the power of advanced analytics and historical data, we aim to develop predictive models that not only forecast the outcomes of the NBA season but also empower coaches, players, fans, and bettors alike. Through comprehensive analysis of diverse datasets encompassing team and player statistics, awards, opponent performance, and injury reports, we seek to uncover hidden patterns and insights that illuminate the path to success in the NBA.

II- Data Collection¶

For this project, datasets chosen englobe NBA data from 1947 to present. These datasets were obtained from Kaggle, and Basketball-Reference.com, reliable websites for basketballdata

  • 2023 NBA Playoffs Summary | Basketball-Reference.com
  • NBA Database (kaggle.com)
  • NBA games data (kaggle.com)
  • Data.org NBA Playoffs Datasets
  • NBA Stats (1947-present) (kaggle.com)
  • NBA Injuries from 2010-2020 (kaggle.com)

Our datasets contains more than 30,000 rows and over 30 columns with statistics encompassing the following;

  1. Team Season Stats: Detailed statistics capturing team performance metrics such as points scored, rebounds, assists, field goal percentages, and defensive statistics.
  2. Player Season Stats: Individual player statistics including points per game, rebounds, assists, steals, blocks, shooting percentages, and other relevant metrics.

  3. Awards (Players) Stats: Historical data on awards won by players, such as MVP awards, Defensive Player of the Year, Rookie of the Year, and other accolades.

  4. Opponent Team Stats: Statistics related to opponents faced by each team, including their performance metrics, strength of schedule, and head-to-head matchups.

  5. Player Injuries: Information on player injuries and their impact on team performance throughout the season.

By assembling these datasets from trusted sources, we ensure the reliability and accuracy of the data used in our predictive models. This comprehensive approach allows us to analyze a wide range of factors that contribute to NBA team performance, enabling us to develop robust and insightful predictions for the upcoming season.

We decided to keep all of the files (mainly CSVs) relevant to this project in a drive so we mounted here.

In [4]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

It's essential to bring in certain libraries so below are the libraries that will accompany us on our journey through this project.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
import scipy.stats as stats
import statsmodels.api as sm
import math

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder

III - Data cleaning & Insights

ready to scrub man

We opted for a step-by-step approach in handling our dataframes. Below, we began by examining the CSV files related to players from each NBA team. Additionally, we consolidated the dataframes into a single one, named "player_df."

In [6]:
df_totals = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Totals.csv')
df_shooting = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Shooting.csv')
df_play_by_play = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Play By Play.csv')
df_per_game = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Player Per Game.csv')
df_per_100_poss = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Per 100 Poss.csv')
df_per_36_minutes = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Per 36 Minutes.csv')


m1 = pd.merge(df_totals, df_shooting, on =['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'POSITION', 'AGE',  'LEAGUE', 'TEAM_ABBREVIATION', 'MINUTES_PLAYED'], how ='left')
m2 = pd.merge(m1, df_play_by_play, on =['SEASON_ID', 'SEASON',  'PLAYER', 'POSITION', 'AGE', 'EXPERIENCE', 'LEAGUE','GAMES', 'MINUTES_PLAYED'], how ='left')
m3 = pd.merge(m2, df_per_game, on =['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'POSITION', 'AGE', 'EXPERIENCE', 'LEAGUE', 'TEAM_ABBREVIATION', 'GAMES',  'FREE_THROW_PERCENTAGE',  'FIELD_GOAL_PERCENT'], how ='left')
m4 = pd.merge(m3, df_per_100_poss, on =['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'AGE', 'LEAGUE', 'TEAM_ABBREVIATION', 'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROW_PERCENTAGE', 'YEARS_OF_EXPERIENCE'], how ='left')
player_df = pd.merge(m4, df_per_36_minutes, on =['SEASON_ID','POSITION', 'AGE', 'LEAGUE',  'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROW_PERCENTAGE', 'YEARS_OF_EXPERIENCE', 'TEAM'], how ='left')

player_df.describe()
Out[6]:
SEASON_ID SEASON PLAYER_ID BIRTH_YEAR AGE EXPERIENCE GAMES GAMES_STARTED MINUTES_PLAYED FIELD_GOALS_MADE ... FREETHROW_ATTEMPTED_PER_36_MIN OFFENSIVE_REBOUND_PER_36_MIN DEFENSIVE_REBOUND_PER_36_MIN TOTAL_REBOUND_PER_36_MIN ASSIST_PER_36_MIN STEALS_PER_36_MIN BLOCKS_PER_36_MIN TURNOVER_PER_36_MIN PERSONAL_FOULS_PER_36_MIN POINTS_PER_36_MIN
count 31787.00000 31787.000000 31787.000000 2870.000000 31765.000000 31787.000000 31787.000000 23150.000000 30704.000000 31787.000000 ... 16697.000000 16697.000000 16697.000000 16697.000000 16697.000000 16697.000000 16697.000000 16697.000000 16697.000000 16697.000000
mean 15894.00000 1994.973008 2646.740114 1944.139721 26.486164 4.967691 48.607796 22.406177 1158.959386 187.590965 ... 3.182793 1.779745 4.587794 6.367036 2.965131 1.124837 0.730017 2.064868 3.598341 13.326556
std 9176.26084 20.403059 1301.138515 15.162153 3.844564 3.723561 26.616961 27.814887 926.420014 184.565391 ... 2.184569 1.633455 2.274799 3.304136 2.144295 0.776187 0.872100 1.305920 1.894635 5.179310
min 1.00000 1947.000000 1.000000 1914.000000 18.000000 1.000000 1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 7947.50000 1979.000000 1625.000000 1938.000000 24.000000 2.000000 25.000000 0.000000 308.000000 37.000000 ... 1.800000 0.700000 3.100000 3.900000 1.500000 0.700000 0.200000 1.400000 2.500000 10.200000
50% 15894.00000 1998.000000 2733.000000 1946.000000 26.000000 4.000000 53.000000 7.000000 986.000000 132.000000 ... 2.900000 1.300000 4.200000 5.700000 2.300000 1.000000 0.500000 1.900000 3.300000 13.100000
75% 23840.50000 2012.000000 3718.000000 1951.000000 29.000000 7.000000 74.000000 42.000000 1882.000000 286.000000 ... 4.200000 2.600000 5.800000 8.400000 4.000000 1.400000 1.000000 2.500000 4.300000 16.200000
max 31787.00000 2024.000000 5197.000000 2002.000000 46.000000 22.000000 90.000000 83.000000 3882.000000 1597.000000 ... 36.000000 36.000000 72.000000 72.000000 36.000000 18.000000 36.000000 36.000000 72.000000 108.000000

8 rows × 126 columns

In [7]:
player_df.columns.tolist()
Out[7]:
['SEASON_ID',
 'SEASON',
 'PLAYER_ID',
 'PLAYER',
 'BIRTH_YEAR',
 'POSITION',
 'AGE',
 'EXPERIENCE',
 'LEAGUE',
 'TEAM_ABBREVIATION',
 'GAMES',
 'GAMES_STARTED',
 'MINUTES_PLAYED',
 'FIELD_GOALS_MADE',
 'FIELD_GOALS_AGAINST',
 'FIELD_GOAL_PERCENTAGE_x',
 'THREE_POINTERS_MADE',
 'THREE_POINTERS_ATTEMPTED',
 'THREE_POINTERS_PERCENTAGE',
 'TWO_POINTERS_MADE',
 'TWO_POINTERS_ATTEMPTED',
 'TWO_POINTERS_PERCENTAGE',
 'EFFECTIVE_FIELD_GOAL_PERCENTAGE',
 'FREE_THROWS_MADE',
 'FREE_THROW_ATTEMPTED',
 'FREE_THROW_PERCENTAGE',
 'OFFENSIVE_REBOUND',
 'DEFENSIVE_REBOUND',
 'TOTAL_REBOUND',
 'ASSISTS',
 'STEALS',
 'BLOCK',
 'TURNOVER',
 'PERSONAL_FOUL',
 'TOTAL_POINTS',
 'YEARS_OF_EXPERIENCE',
 'GAMES_PLAYED',
 'FIELD_GOAL_PERCENT',
 'AVERAGE_DISTANCE_OF_FIELD_GOALS_ATTEMPTED',
 'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_TWO_POINT_RANGE',
 'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_THREE_POINT_RANGE',
 'FIELD_GOAL_PERCENTAGE_FROM_TWO_POINT_RANGE',
 'FIELD_GOAL_PERCENTAGE_FROM_THREE_POINT_RANGE',
 'PERCENT_ASSISTED_TWO_POINT_FIELD_GOAL',
 'PERCENT_ASSISTED_THREE_POINT_FIELD_GOAL',
 'PERCENTAGE_DUNKS_OF_FIELD_GOAL_ATTEMPTED',
 'NUMBER_OF_DUNKS',
 'PERCENTAGE_CORNER_THREE_OF_THREE_POINTER_ATTEMPTED',
 'CORNER_THREE_POINT_PERCENT',
 'NUMBER_HEAVES_ATTEMPTED',
 'NUMBER_HEAVES_MADE',
 'PLAYED_ID',
 'TEAM',
 'POINTS_PER_GAME_PERCENT',
 'SHOOTING_GUARD_PERCENT',
 'SMALL_FORWARD_PERCENT',
 'POWER_FORWARD_PERCENT',
 'CENTER_PERCENT',
 'ON_COURT_PLUS_MINUES_PER_100_POSSESION',
 'NET_PLUS_MINUES_PER_100_POSSESION',
 'BAD_PASS_TURNOVER',
 'LOST_BALL_TURNOVER',
 'SHOOTING_FOUL_COMMITTED',
 'OFFENSIVE_FOUL_COMMITTED',
 'SHOOTING_FOUL_DRAWN',
 'OFFENSIVE_FOUL_DRAWN',
 'POINTS_GENERATED_BY_ASSISTS',
 'AND1S',
 'FIELD_GOAL_ATTEMPTED_BLOCKED',
 'gs',
 'MINUTES_PLAYED_PER_GAME',
 'FIELD_GOAL_PER_GAME',
 'FIELD_GOAL_ATTEMPTED_PER_GAME',
 'THREE_POINTERS_PER_GAME',
 'THREE_POINT_ATTEMPTED_PER_GAME',
 'THREE_POINT_PERCENTAGE',
 'TWO_POINTER_PER_GAME',
 'TWO_POINTER_ATTEMPTED_PER_GAME',
 'TWO_POINT_PERCENTAGE',
 'e_fg_percent',
 'FREE_THROW_PER_GAME',
 'FREE_THROW_ATTEMPTED_PER_GAME',
 'OFFENSIVE_REBOUND_PER_GAME',
 'DEFENSIVE_REBOUND_PER_GAME',
 'TOTAL_REBOUND_PER_GAME',
 'ASSIST_PER_GAME',
 'STEAL_PER_GAME',
 'BLOCKS_PER_GAME',
 'TURNOVERS_PER_GAME',
 'PERSONAL_FOUL_PER_GAME',
 'POINTS_PER_GAME',
 'POSITINON',
 'FIELD_GOALS_PER_100_POSSESION',
 'FIELD_GOALS_ATTEMPTED_PER_100_POSSESION',
 'FIELD_GOALS_PERCENTAGE',
 'THREE_POINTERS_PER_100_POSSESION',
 'THREE_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'TWO_POINTERS_PER_100_POSSESION',
 'TWO_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'FREE_THROWS_PER_100_POSSESION',
 'FREE_THROWS_ATTEMPTED_PER_100_POSSESION',
 'OFFENSIVE_REBOUND_PER_100_POSSESION',
 'DEFENSIVE_REBOUND_PER_100_POSSESION',
 'TOTAL_REBOUNDS_PER_100_POSSESION',
 'ASSIST_PER_100_POSSESION',
 'STEALS_PER_100_POSSESION',
 'BLOCKS_PER_100_POSSESION',
 'TURNOVER_PER_100_POSSESION',
 'PERSONAL_FOULS_PER_100_POSSESION',
 'POINTS_PER_100_POSSESION',
 'OFFESNIVE_Rating',
 'DEFENSIVE RATING',
 'SEAONS',
 'PLAYR_ID',
 'PLAYER_NAME',
 'FIELD_GOALS_PER_36_MINUTES',
 'FIELD_GOALS_ATTEMPTED_PER_GAME',
 'FIELD_GOAL_PERCENTAGE_y',
 'THREE_POINTERS_PER_36_MIN',
 'THREE_POINTERS_ATTEMPTED_PER_36_MIN',
 'TWO_POINTERS_PER_36_MIN',
 'TWO_POINTERS_ATTEMPTED_PER_36_MIN',
 'FREETHROW_PER_36_MIN',
 'FREETHROW_ATTEMPTED_PER_36_MIN',
 'OFFENSIVE_REBOUND_PER_36_MIN',
 'DEFENSIVE_REBOUND_PER_36_MIN',
 'TOTAL_REBOUND_PER_36_MIN',
 'ASSIST_PER_36_MIN',
 'STEALS_PER_36_MIN',
 'BLOCKS_PER_36_MIN',
 'TURNOVER_PER_36_MIN',
 'PERSONAL_FOULS_PER_36_MIN',
 'POINTS_PER_36_MIN']

This function will help us identify any columns that are duplicated in our unified player dataframe

In [8]:
def common_columns(dataframes):
    # Create a dictionary to count the occurrences of each column
    column_counts = {}

    # Iterate over all dataframes
    for df in dataframes:
        # Iterate over all columns in the current dataframe
        for col in df.columns:
            # If the column is already in the dictionary, increment its count
            if col in column_counts:
                column_counts[col] += 1
            # Otherwise, add the column to the dictionary with a count of 1
            else:
                column_counts[col] = 1

    # Return a list of columns that appear at least twice
    return [col for col, count in column_counts.items() if count >= 2]
In [9]:
dfs=[df_totals,
df_shooting ,
df_play_by_play,
df_per_game,
df_per_100_poss,
df_per_36_minutes]
# dfs is your list of dataframes
common_cols = common_columns(dfs)
print(common_cols)
['SEASON_ID', 'SEASON', 'PLAYER_ID', 'PLAYER', 'POSITION', 'AGE', 'EXPERIENCE', 'LEAGUE', 'TEAM_ABBREVIATION', 'GAMES', 'GAMES_STARTED', 'MINUTES_PLAYED', 'FIELD_GOAL_PERCENTAGE', 'THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROW_PERCENTAGE', 'YEARS_OF_EXPERIENCE', 'FIELD_GOAL_PERCENT', 'TEAM']
In [10]:
player_df =player_df.drop([
'PLAYER_NAME','POSITINON','PLAYED_ID','GAMES_PLAYED', 'SEASON_ID',
 'BIRTH_YEAR',
 'POSITION',
 'LEAGUE',
 'THREE_POINTERS_ATTEMPTED',
 'TWO_POINTERS_ATTEMPTED',
 'YEARS_OF_EXPERIENCE',
 'GAMES_PLAYED',
 'FIELD_GOAL_PERCENT',
 'AVERAGE_DISTANCE_OF_FIELD_GOALS_ATTEMPTED',
 'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_TWO_POINT_RANGE',
 'PERCENTAGE_FIELD_GOAL_ATTEMPTED_FROM_THREE_POINT_RANGE',
 'FIELD_GOAL_PERCENTAGE_FROM_TWO_POINT_RANGE',
 'FIELD_GOAL_PERCENTAGE_FROM_THREE_POINT_RANGE',
 'PERCENT_ASSISTED_TWO_POINT_FIELD_GOAL',
 'PERCENT_ASSISTED_THREE_POINT_FIELD_GOAL',
 'PERCENTAGE_DUNKS_OF_FIELD_GOAL_ATTEMPTED',
 'NUMBER_OF_DUNKS',
 'PERCENTAGE_CORNER_THREE_OF_THREE_POINTER_ATTEMPTED',
 'CORNER_THREE_POINT_PERCENT',
 'NUMBER_HEAVES_ATTEMPTED',
 'NUMBER_HEAVES_MADE',
 'PLAYED_ID',
 'POINTS_PER_GAME_PERCENT',
 'SHOOTING_GUARD_PERCENT',
 'SMALL_FORWARD_PERCENT',
 'POWER_FORWARD_PERCENT',
 'CENTER_PERCENT',
 'ON_COURT_PLUS_MINUES_PER_100_POSSESION',
 'NET_PLUS_MINUES_PER_100_POSSESION',
 'BAD_PASS_TURNOVER',
 'LOST_BALL_TURNOVER',
 'SHOOTING_FOUL_COMMITTED',
 'OFFENSIVE_FOUL_COMMITTED',
 'SHOOTING_FOUL_DRAWN',
 'OFFENSIVE_FOUL_DRAWN',
 'POINTS_GENERATED_BY_ASSISTS',
 'AND1S',
 'FIELD_GOAL_ATTEMPTED_BLOCKED',
 'gs',
 'MINUTES_PLAYED_PER_GAME',
 'FIELD_GOAL_PER_GAME',
 'FIELD_GOAL_ATTEMPTED_PER_GAME',
 'THREE_POINTERS_PER_GAME',
 'THREE_POINT_ATTEMPTED_PER_GAME',
 'THREE_POINT_PERCENTAGE',
 'TWO_POINTER_PER_GAME',
 'TWO_POINTER_ATTEMPTED_PER_GAME',
 'TWO_POINT_PERCENTAGE',
 'e_fg_percent',
 'FREE_THROW_PER_GAME',
 'FREE_THROW_ATTEMPTED_PER_GAME',
 'OFFENSIVE_REBOUND_PER_GAME',
 'DEFENSIVE_REBOUND_PER_GAME',
 'TOTAL_REBOUND_PER_GAME',
 'ASSIST_PER_GAME',
 'STEAL_PER_GAME',
 'BLOCKS_PER_GAME',
 'TURNOVERS_PER_GAME',
 'PERSONAL_FOUL_PER_GAME',
 'POINTS_PER_GAME',
 'POSITINON',
 'FIELD_GOALS_PER_100_POSSESION',
 'FIELD_GOALS_ATTEMPTED_PER_100_POSSESION',
 'FIELD_GOALS_PERCENTAGE',
 'THREE_POINTERS_PER_100_POSSESION',
 'THREE_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'TWO_POINTERS_PER_100_POSSESION',
 'TWO_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'FREE_THROWS_PER_100_POSSESION',
 'FREE_THROWS_ATTEMPTED_PER_100_POSSESION',
 'OFFENSIVE_REBOUND_PER_100_POSSESION',
 'DEFENSIVE_REBOUND_PER_100_POSSESION',
 'TOTAL_REBOUNDS_PER_100_POSSESION',
 'ASSIST_PER_100_POSSESION',
 'STEALS_PER_100_POSSESION',
 'BLOCKS_PER_100_POSSESION',
 'TURNOVER_PER_100_POSSESION',
 'PERSONAL_FOULS_PER_100_POSSESION',
 'POINTS_PER_100_POSSESION',
 'SEAONS',
 'PLAYR_ID',
 'PLAYER_NAME',
 'FIELD_GOALS_PER_36_MINUTES',
 'FIELD_GOALS_ATTEMPTED_PER_GAME',
 'FIELD_GOAL_PERCENTAGE_y',
 'THREE_POINTERS_PER_36_MIN',
 'THREE_POINTERS_ATTEMPTED_PER_36_MIN',
 'TWO_POINTERS_PER_36_MIN',
 'TWO_POINTERS_ATTEMPTED_PER_36_MIN',
 'FREETHROW_PER_36_MIN',
 'FREETHROW_ATTEMPTED_PER_36_MIN',
 'OFFENSIVE_REBOUND_PER_36_MIN',
 'DEFENSIVE_REBOUND_PER_36_MIN',
 'TOTAL_REBOUND_PER_36_MIN',
 'ASSIST_PER_36_MIN',
 'STEALS_PER_36_MIN',
 'BLOCKS_PER_36_MIN',
 'TURNOVER_PER_36_MIN',
 'PERSONAL_FOULS_PER_36_MIN',
 'POINTS_PER_36_MIN'], axis =1)
In [11]:
player_df.describe()
Out[11]:
SEASON PLAYER_ID AGE EXPERIENCE GAMES GAMES_STARTED MINUTES_PLAYED FIELD_GOALS_MADE FIELD_GOALS_AGAINST FIELD_GOAL_PERCENTAGE_x ... DEFENSIVE_REBOUND TOTAL_REBOUND ASSISTS STEALS BLOCK TURNOVER PERSONAL_FOUL TOTAL_POINTS OFFESNIVE_Rating DEFENSIVE RATING
count 31787.000000 31787.000000 31765.000000 31787.000000 31787.000000 23150.000000 30704.000000 31787.000000 31787.000000 31628.000000 ... 27130.000000 30893.000000 31787.000000 26161.000000 26162.000000 26152.000000 31787.000000 31787.000000 16635.000000 16697.000000
mean 1994.973008 2646.740114 26.486164 4.967691 48.607796 22.406177 1158.959386 187.590965 414.794035 0.429751 ... 144.706377 216.707345 109.247711 37.799014 23.282547 70.743691 109.857961 492.135433 103.039615 107.991316
std 20.403059 1301.138515 3.844564 3.723561 26.616961 27.814887 926.420014 184.565391 391.133337 0.101668 ... 146.348383 224.381109 130.954467 37.430130 35.281323 67.437942 83.664565 485.470475 18.632958 5.670590
min 1947.000000 1.000000 18.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 33.000000
25% 1979.000000 1625.000000 24.000000 2.000000 25.000000 0.000000 308.000000 37.000000 90.000000 0.390000 ... 32.000000 48.000000 17.000000 8.000000 3.000000 16.000000 34.000000 98.000000 97.000000 104.000000
50% 1998.000000 2733.000000 26.000000 4.000000 53.000000 7.000000 986.000000 132.000000 302.000000 0.438000 ... 103.000000 152.000000 64.000000 27.000000 11.000000 51.000000 98.000000 343.000000 105.000000 108.000000
75% 2012.000000 3718.000000 29.000000 7.000000 74.000000 42.000000 1882.000000 286.000000 637.000000 0.481000 ... 209.000000 308.000000 152.000000 56.000000 28.000000 107.000000 173.000000 750.000000 112.000000 112.000000
max 2024.000000 5197.000000 46.000000 22.000000 90.000000 83.000000 3882.000000 1597.000000 3159.000000 1.000000 ... 1111.000000 2149.000000 1164.000000 346.000000 456.000000 464.000000 386.000000 4029.000000 300.000000 130.000000

8 rows × 29 columns

In [12]:
player_df.shape
Out[12]:
(31787, 32)
In [13]:
player_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31787 entries, 0 to 31786
Data columns (total 32 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   SEASON                           31787 non-null  int64  
 1   PLAYER_ID                        31787 non-null  int64  
 2   PLAYER                           31787 non-null  object 
 3   AGE                              31765 non-null  float64
 4   EXPERIENCE                       31787 non-null  int64  
 5   TEAM_ABBREVIATION                31787 non-null  object 
 6   GAMES                            31787 non-null  int64  
 7   GAMES_STARTED                    23150 non-null  float64
 8   MINUTES_PLAYED                   30704 non-null  float64
 9   FIELD_GOALS_MADE                 31787 non-null  int64  
 10  FIELD_GOALS_AGAINST              31787 non-null  int64  
 11  FIELD_GOAL_PERCENTAGE_x          31628 non-null  float64
 12  THREE_POINTERS_MADE              25435 non-null  float64
 13  THREE_POINTERS_PERCENTAGE        21250 non-null  float64
 14  TWO_POINTERS_MADE                31787 non-null  int64  
 15  TWO_POINTERS_PERCENTAGE          31540 non-null  float64
 16  EFFECTIVE_FIELD_GOAL_PERCENTAGE  31628 non-null  float64
 17  FREE_THROWS_MADE                 31787 non-null  int64  
 18  FREE_THROW_ATTEMPTED             31787 non-null  int64  
 19  FREE_THROW_PERCENTAGE            30488 non-null  float64
 20  OFFENSIVE_REBOUND                27130 non-null  float64
 21  DEFENSIVE_REBOUND                27130 non-null  float64
 22  TOTAL_REBOUND                    30893 non-null  float64
 23  ASSISTS                          31787 non-null  int64  
 24  STEALS                           26161 non-null  float64
 25  BLOCK                            26162 non-null  float64
 26  TURNOVER                         26152 non-null  float64
 27  PERSONAL_FOUL                    31787 non-null  int64  
 28  TOTAL_POINTS                     31787 non-null  int64  
 29  TEAM                             16702 non-null  object 
 30  OFFESNIVE_Rating                 16635 non-null  float64
 31  DEFENSIVE RATING                 16697 non-null  float64
dtypes: float64(17), int64(12), object(3)
memory usage: 7.8+ MB
In [14]:
player_df.head()
Out[14]:
SEASON PLAYER_ID PLAYER AGE EXPERIENCE TEAM_ABBREVIATION GAMES GAMES_STARTED MINUTES_PLAYED FIELD_GOALS_MADE ... TOTAL_REBOUND ASSISTS STEALS BLOCK TURNOVER PERSONAL_FOUL TOTAL_POINTS TEAM OFFESNIVE_Rating DEFENSIVE RATING
0 2024 5025 A.J. Green 24.0 2 MIL 36 0.0 335.0 52 ... 38.0 21 3.0 2.0 4.0 34 156 MIL 128.0 121.0
1 2024 5026 A.J. Lawson 23.0 2 DAL 27 0.0 230.0 39 ... 32.0 13 9.0 3.0 10.0 19 103 DAL 110.0 117.0
2 2024 5027 AJ Griffin 20.0 2 ATL 18 0.0 132.0 13 ... 14.0 4 1.0 1.0 6.0 6 37 ATL 82.0 126.0
3 2024 4219 Aaron Gordon 28.0 10 DEN 49 49.0 1555.0 266 ... 327.0 150 37.0 34.0 68.0 90 677 DEN 123.0 116.0
4 2024 4582 Aaron Holiday 27.0 6 HOU 51 1.0 913.0 134 ... 95.0 96 28.0 4.0 40.0 83 370 HOU 118.0 115.0

5 rows × 32 columns

In [15]:
player_df.isnull().sum().tolist()
Out[15]:
[0,
 0,
 0,
 22,
 0,
 0,
 0,
 8637,
 1083,
 0,
 0,
 159,
 6352,
 10537,
 0,
 247,
 159,
 0,
 0,
 1299,
 4657,
 4657,
 894,
 0,
 5626,
 5625,
 5635,
 0,
 0,
 15085,
 15152,
 15090]
In [16]:
player_df['SEASON'].nunique()
Out[16]:
78
In [17]:
player_df['PLAYER_ID'].nunique()
Out[17]:
5197
In [18]:
player_df = player_df.rename(columns={'FIELD_GOAL_PERCENTAGE_x': 'FIELD_GOAL_PERCENTAGE'})
player_df = player_df.rename(columns={'FIELD_GOALS_AGAINST': 'FIELD_GOAL_ATTEMPTED'})
player_df = player_df.rename(columns={'OFFESNIVE_Rating': 'OFFENSIVE_RATING'})
In [19]:
player_df.describe()
Out[19]:
SEASON PLAYER_ID AGE EXPERIENCE GAMES GAMES_STARTED MINUTES_PLAYED FIELD_GOALS_MADE FIELD_GOAL_ATTEMPTED FIELD_GOAL_PERCENTAGE ... DEFENSIVE_REBOUND TOTAL_REBOUND ASSISTS STEALS BLOCK TURNOVER PERSONAL_FOUL TOTAL_POINTS OFFENSIVE_RATING DEFENSIVE RATING
count 31787.000000 31787.000000 31765.000000 31787.000000 31787.000000 23150.000000 30704.000000 31787.000000 31787.000000 31628.000000 ... 27130.000000 30893.000000 31787.000000 26161.000000 26162.000000 26152.000000 31787.000000 31787.000000 16635.000000 16697.000000
mean 1994.973008 2646.740114 26.486164 4.967691 48.607796 22.406177 1158.959386 187.590965 414.794035 0.429751 ... 144.706377 216.707345 109.247711 37.799014 23.282547 70.743691 109.857961 492.135433 103.039615 107.991316
std 20.403059 1301.138515 3.844564 3.723561 26.616961 27.814887 926.420014 184.565391 391.133337 0.101668 ... 146.348383 224.381109 130.954467 37.430130 35.281323 67.437942 83.664565 485.470475 18.632958 5.670590
min 1947.000000 1.000000 18.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 33.000000
25% 1979.000000 1625.000000 24.000000 2.000000 25.000000 0.000000 308.000000 37.000000 90.000000 0.390000 ... 32.000000 48.000000 17.000000 8.000000 3.000000 16.000000 34.000000 98.000000 97.000000 104.000000
50% 1998.000000 2733.000000 26.000000 4.000000 53.000000 7.000000 986.000000 132.000000 302.000000 0.438000 ... 103.000000 152.000000 64.000000 27.000000 11.000000 51.000000 98.000000 343.000000 105.000000 108.000000
75% 2012.000000 3718.000000 29.000000 7.000000 74.000000 42.000000 1882.000000 286.000000 637.000000 0.481000 ... 209.000000 308.000000 152.000000 56.000000 28.000000 107.000000 173.000000 750.000000 112.000000 112.000000
max 2024.000000 5197.000000 46.000000 22.000000 90.000000 83.000000 3882.000000 1597.000000 3159.000000 1.000000 ... 1111.000000 2149.000000 1164.000000 346.000000 456.000000 464.000000 386.000000 4029.000000 300.000000 130.000000

8 rows × 29 columns

Given that some of data from different columns was absent, we decided to fill them with teh mean. Using just teh mean of the whole dataframe would not be fit as we would in that case, be assigning the mean of players from different teams to players of other teams. So we decided to assign use the mean of other players within the same team and teh same year.

In [20]:
cols_to_fill = ['THREE_POINTERS_MADE','THREE_POINTERS_PERCENTAGE','STEALS','BLOCK','TURNOVER','DEFENSIVE_REBOUND','OFFENSIVE_REBOUND']
for col in cols_to_fill:
    player_df[col] = player_df.groupby(['TEAM_ABBREVIATION', 'SEASON'])[col].transform(lambda x: x.fillna(x.mean()))
In [20]:

In [21]:
player_df['AGE'].hist()
Out[21]:
<Axes: >

How many teams do we even have?

In [22]:
player_df['TEAM'].unique().tolist()
Out[22]:
['MIL',
 'DAL',
 'ATL',
 'DEN',
 'HOU',
 'IND',
 'OKC',
 'CHI',
 'ORL',
 'BOS',
 'TOT',
 'DET',
 'NYK',
 'LAL',
 'SAC',
 'MIA',
 'CHO',
 'LAC',
 'GSW',
 'POR',
 'MIN',
 'WAS',
 'BRK',
 'MEM',
 'SAS',
 'PHO',
 'NOP',
 'UTA',
 'TOR',
 'PHI',
 'CLE',
 'CHA',
 'NOH',
 'NJN',
 'SEA',
 'NOK',
 'CHH',
 'VAN',
 'WSB',
 nan]

There's a rumour going round that younger players perform better. We just wanted to test it...

In [23]:
correlation = player_df['AGE'].corr(player_df['FIELD_GOALS_MADE'])
print(correlation)
0.027280815763253866
In [24]:
plt.figure(figsize=(10,8))
sns.heatmap(player_df[['AGE', 'FIELD_GOALS_MADE']].corr(), annot=True, cmap='coolwarm')
plt.show()
In [25]:
df_opponent_stats_100=pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Opponent Stats Per 100 Poss.csv')
df_opponents_s_per_game=pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Opponent Stats Per Game.csv')
df_opponent_totals = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/Opponent Totals.csv')
In [26]:
df_opponent_stats_100.shape
Out[26]:
(1402, 28)
In [27]:
df_opponents_s_per_game.shape
Out[27]:
(1845, 28)
In [28]:
df_opponent_totals.shape
Out[28]:
(1845, 28)
In [29]:
df_opponent_stats_100.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1402 entries, 0 to 1401
Data columns (total 28 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   SEASON                                               1402 non-null   int64  
 1   LEAGUE                                               1402 non-null   object 
 2   TEAM_NAME                                            1402 non-null   object 
 3   TEAM_ABBREVIATION                                    1402 non-null   object 
 4   PLAYOFFS                                             1402 non-null   bool   
 5   GAMES                                                1402 non-null   int64  
 6   MINUTES_PLAYED                                       1402 non-null   int64  
 7   OPPONENT_FIELD_GOALS_PER_100_POSSESION               1402 non-null   float64
 8   OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION     1402 non-null   float64
 9   OPPONENT_FIELD_GOALS_PERCENTAGE                      1402 non-null   float64
 10  OPPONENT_THREE_POINTERS_PER_100_POSSESION            1283 non-null   float64
 11  OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION  1283 non-null   float64
 12  OPPONENT_THREE_POINTERS_PERCENTAGE                   1283 non-null   float64
 13  OPPONENT_TWO_POINTER_PER_100_POSSESION               1402 non-null   float64
 14  OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION    1402 non-null   float64
 15  OPPONENT_TWO_POINTERS_PERCENTAGE                     1402 non-null   float64
 16  OPPONENT_FREE_THROWS_PER_100_POSSESION               1402 non-null   float64
 17  OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION     1402 non-null   float64
 18  OPPONENT_FREE_THROWS_PERCENTAGE                      1402 non-null   float64
 19  OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION         1402 non-null   float64
 20  OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION         1402 non-null   float64
 21  OPPONENT_TOTAL_REBOUND_PER_100_POSSESION             1402 non-null   float64
 22  OPPONENT_ASSIST_PER_100_POSSESION                    1402 non-null   float64
 23  OPPONENT_STEAL_PER_100_POSSESION                     1402 non-null   float64
 24  OPPONENT_STEAL_PER_100_POSSESION.1                   1402 non-null   float64
 25  OPPONENT_TURNOVER_PER_100_POSSESION                  1402 non-null   float64
 26  OPPONENT_PERSONAL_FOUL_PER_100_POSSESION             1402 non-null   float64
 27  OPPONENT_POINTS_PER_100_POSSESION                    1402 non-null   float64
dtypes: bool(1), float64(21), int64(3), object(3)
memory usage: 297.2+ KB
In [30]:
df_opponents_s_per_game.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1845 entries, 0 to 1844
Data columns (total 28 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   SEASON                                      1845 non-null   int64  
 1   LEAGUE                                      1845 non-null   object 
 2   TEAM                                        1845 non-null   object 
 3   TEAM_ABBREAVIATION                          1758 non-null   object 
 4   PLAYOFFS                                    1845 non-null   bool   
 5   GAMES                                       1844 non-null   float64
 6   MINUTES_PLAYED_PER_GAME                     1655 non-null   float64
 7   OPPONENTS_FIELD_GOALS_PER_GAME              1605 non-null   float64
 8   OPPONENTS_FIELD_GOALS_Attempted_per_game    1605 non-null   float64
 9   OPPONENT_FIELD_GOALS_PERCENTAGE             1605 non-null   float64
 10  OPPONENTS_THREE_POINTERS_PER_GAME           1402 non-null   float64
 11  PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME  1402 non-null   float64
 12  OPPONENTS_THREE_POINTERS_PERCENTAGE         1402 non-null   float64
 13  OPPONENTS_TWO_POINTERS_PER_GAME             1605 non-null   float64
 14  OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME   1605 non-null   float64
 15  OPPONENTS_TWO_POINTERS_PERCENTAGE           1605 non-null   float64
 16  OPPONENTS_FREE_THROWS_PER_GAME              1605 non-null   float64
 17  OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME    1605 non-null   float64
 18  OPPONENTS_FREE_THROWS_PERCENTAGE            1605 non-null   float64
 19  OPPONENTS_OFFENSIVE_REBOUND_PER_GAME        1482 non-null   float64
 20  OPPONENTS_DEFENSIVE_REBOUND_PER_GAME        1482 non-null   float64
 21  OPPONENTS_TOTAL_REBOUND_PER_GAME            1601 non-null   float64
 22  OPPONENTS_ASSIST_PER_GAME                   1605 non-null   float64
 23  OPPONENTS_STEAL_PER_GAME                    1456 non-null   float64
 24  OPPONENTSBLOCKS_PER_GAME                    1456 non-null   float64
 25  OPPONENTS_TURNOVER_PER_GAME                 1527 non-null   float64
 26  OPPONENTS_PERSONAL_FOUL_PER_GAME            1605 non-null   float64
 27  OPPONENTS_POINTS_PER_GAME                   1844 non-null   float64
dtypes: bool(1), float64(23), int64(1), object(3)
memory usage: 391.1+ KB
In [31]:
df_opponent_totals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1845 entries, 0 to 1844
Data columns (total 28 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   SEASON                              1845 non-null   int64  
 1   LEAGUE                              1845 non-null   object 
 2   TEAM                                1845 non-null   object 
 3   ABBREVIATION                        1758 non-null   object 
 4   PLAYOFFS                            1845 non-null   bool   
 5   GAMES                               1844 non-null   float64
 6   MINUTES_PLAYED                      1655 non-null   float64
 7   OPPONENT_FIELD_GOALS                1605 non-null   float64
 8   OPPONENT_FIELD_GOALS_ATTEMPTED      1605 non-null   float64
 9   OPPPONENT_FIELD_GOALS_PERCENTAGE    1605 non-null   float64
 10  OPPONENT_THREE_POINTERS             1402 non-null   float64
 11  OPPONENT_THREE_POINTERS_ATTEMPTED   1402 non-null   float64
 12  OPPONENT_THREE_POINTERS_PERCENTAGE  1402 non-null   float64
 13  OPPONENT_TWO_POINTERS               1605 non-null   float64
 14  OPPONENT_TWO_POINTERS_ATTEMPTED     1605 non-null   float64
 15  OPPONENT_TWO_POINTERS_PERCENTAGE    1605 non-null   float64
 16  OPPONENT_FREE_THROWS                1605 non-null   float64
 17  OPPONENT_FREE_THROWS_ATTEMPTED      1605 non-null   float64
 18  OPPONENT_FREE_THROW_PERCENTAGE      1605 non-null   float64
 19  OPPONENTS_OFFENSIVE_REBOUND         1482 non-null   float64
 20  OPPONENTS_DEFENSIVE_REBOUND         1482 non-null   float64
 21  OPPONENTS_TOTAL_REBOUND             1601 non-null   float64
 22  OPPONENTS_ASSIST                    1605 non-null   float64
 23  OPPONENTS_STEALS                    1456 non-null   float64
 24  OPPONENTS_BLOCKS                    1456 non-null   float64
 25  OPPONENTS_TURNOVERS                 1527 non-null   float64
 26  OPPONENTS_PERSONAL_FOUL             1605 non-null   float64
 27  OPPONENTS_POINTS                    1844 non-null   float64
dtypes: bool(1), float64(23), int64(1), object(3)
memory usage: 391.1+ KB
In [32]:
del df_opponent_stats_100['TEAM_NAME']
In [33]:
del df_opponent_totals['ABBREVIATION']
In [34]:
def common_columns(dataframes):
    # Create a dictionary to count the occurrences of each column
    column_counts = {}

    # Iterate over all dataframes
    for df in dataframes:
        # Iterate over all columns in the current dataframe
        for col in df.columns:
            # If the column is already in the dictionary, increment its count
            if col in column_counts:
                column_counts[col] += 1
            # Otherwise, add the column to the dictionary with a count of 1
            else:
                column_counts[col] = 1

    # Return a list of columns that appear at least twice
    return [col for col, count in column_counts.items() if count >= 2]
In [35]:
dfs2=[df_opponent_stats_100,df_opponents_s_per_game,df_opponent_totals]
# dfs is your list of dataframes
common_cols2 = common_columns(dfs2)
print(common_cols2)
['SEASON', 'LEAGUE', 'PLAYOFFS', 'GAMES', 'MINUTES_PLAYED', 'OPPONENT_FIELD_GOALS_PERCENTAGE', 'OPPONENT_THREE_POINTERS_PERCENTAGE', 'OPPONENT_TWO_POINTERS_PERCENTAGE', 'TEAM']
In [36]:
m1 = pd.merge(df_opponent_stats_100, df_opponents_s_per_game, on =['SEASON', 'LEAGUE', 'PLAYOFFS', 'GAMES', 'OPPONENT_FIELD_GOALS_PERCENTAGE'], how ='left')
opponent_team_df = pd.merge(m1, df_opponent_totals, on =['SEASON', 'LEAGUE', 'PLAYOFFS', 'GAMES', 'MINUTES_PLAYED',  'OPPONENT_THREE_POINTERS_PERCENTAGE', 'OPPONENT_TWO_POINTERS_PERCENTAGE', 'TEAM'], how ='left')

opponent_team_df.describe()
Out[36]:
SEASON GAMES MINUTES_PLAYED OPPONENT_FIELD_GOALS_PER_100_POSSESION OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION OPPONENT_FIELD_GOALS_PERCENTAGE OPPONENT_THREE_POINTERS_PER_100_POSSESION OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION OPPONENT_THREE_POINTERS_PERCENTAGE OPPONENT_TWO_POINTER_PER_100_POSSESION ... OPPONENT_FREE_THROW_PERCENTAGE OPPONENTS_OFFENSIVE_REBOUND OPPONENTS_DEFENSIVE_REBOUND OPPONENTS_TOTAL_REBOUND OPPONENTS_ASSIST OPPONENTS_STEALS OPPONENTS_BLOCKS OPPONENTS_TURNOVERS OPPONENTS_PERSONAL_FOUL OPPONENTS_POINTS
count 1817.000000 1817.000000 1817.000000 1817.000000 1817.000000 1817.000000 1673.000000 1673.000000 1673.000000 1817.000000 ... 1406.000000 1406.000000 1406.000000 1406.000000 1406.000000 1406.000000 1406.000000 1406.000000 1406.000000 1406.000000
mean 2000.765548 80.104568 19360.209136 40.553990 87.672262 0.462651 6.096533 17.246742 0.338803 34.940451 ... 0.757070 999.862020 2463.342817 3463.204836 1871.770270 649.174964 401.936700 1268.086771 1786.110953 8281.305121
std 14.414455 6.636741 1605.170296 1.894964 2.328820 0.019595 3.698137 9.985400 0.038049 4.564443 ... 0.017028 195.102932 277.961134 381.363989 256.438149 103.089005 70.476009 233.622830 264.816083 927.387201
min 1974.000000 11.000000 2640.000000 34.900000 78.000000 0.402000 0.200000 1.500000 0.115000 23.700000 ... 0.694000 176.000000 360.000000 536.000000 255.000000 100.000000 53.000000 185.000000 273.000000 1138.000000
25% 1989.000000 82.000000 19755.000000 39.200000 86.200000 0.450000 2.800000 8.400000 0.329000 31.300000 ... 0.746000 868.000000 2345.000000 3350.000000 1736.250000 589.250000 357.000000 1130.000000 1653.000000 7942.000000
50% 2002.000000 82.000000 19805.000000 40.600000 87.800000 0.462000 6.100000 17.300000 0.349000 33.900000 ... 0.757000 1000.500000 2479.500000 3497.000000 1892.000000 648.000000 403.000000 1246.000000 1808.000000 8421.500000
75% 2013.000000 82.000000 19855.000000 41.800000 89.200000 0.475000 8.200000 22.800000 0.363000 39.300000 ... 0.768000 1146.000000 2631.750000 3661.750000 2046.750000 714.000000 446.750000 1404.000000 1965.750000 8821.000000
max 2024.000000 84.000000 20460.000000 47.000000 95.500000 0.536000 14.700000 40.500000 0.411000 46.300000 ... 0.816000 1662.000000 3154.000000 4750.000000 2537.000000 955.000000 654.000000 1980.000000 2453.000000 10723.000000

8 rows × 64 columns

In [37]:
opponent_team_df.shape
Out[37]:
(1817, 69)
In [38]:
opponent_team_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1817 entries, 0 to 1816
Data columns (total 69 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   SEASON                                               1817 non-null   int64  
 1   LEAGUE                                               1817 non-null   object 
 2   TEAM_ABBREVIATION                                    1817 non-null   object 
 3   PLAYOFFS                                             1817 non-null   bool   
 4   GAMES                                                1817 non-null   int64  
 5   MINUTES_PLAYED                                       1817 non-null   int64  
 6   OPPONENT_FIELD_GOALS_PER_100_POSSESION               1817 non-null   float64
 7   OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION     1817 non-null   float64
 8   OPPONENT_FIELD_GOALS_PERCENTAGE                      1817 non-null   float64
 9   OPPONENT_THREE_POINTERS_PER_100_POSSESION            1673 non-null   float64
 10  OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION  1673 non-null   float64
 11  OPPONENT_THREE_POINTERS_PERCENTAGE                   1673 non-null   float64
 12  OPPONENT_TWO_POINTER_PER_100_POSSESION               1817 non-null   float64
 13  OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION    1817 non-null   float64
 14  OPPONENT_TWO_POINTERS_PERCENTAGE                     1817 non-null   float64
 15  OPPONENT_FREE_THROWS_PER_100_POSSESION               1817 non-null   float64
 16  OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION     1817 non-null   float64
 17  OPPONENT_FREE_THROWS_PERCENTAGE                      1817 non-null   float64
 18  OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION         1817 non-null   float64
 19  OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION         1817 non-null   float64
 20  OPPONENT_TOTAL_REBOUND_PER_100_POSSESION             1817 non-null   float64
 21  OPPONENT_ASSIST_PER_100_POSSESION                    1817 non-null   float64
 22  OPPONENT_STEAL_PER_100_POSSESION                     1817 non-null   float64
 23  OPPONENT_STEAL_PER_100_POSSESION.1                   1817 non-null   float64
 24  OPPONENT_TURNOVER_PER_100_POSSESION                  1817 non-null   float64
 25  OPPONENT_PERSONAL_FOUL_PER_100_POSSESION             1817 non-null   float64
 26  OPPONENT_POINTS_PER_100_POSSESION                    1817 non-null   float64
 27  TEAM                                                 1817 non-null   object 
 28  TEAM_ABBREAVIATION                                   1806 non-null   object 
 29  MINUTES_PLAYED_PER_GAME                              1817 non-null   float64
 30  OPPONENTS_FIELD_GOALS_PER_GAME                       1817 non-null   float64
 31  OPPONENTS_FIELD_GOALS_Attempted_per_game             1817 non-null   float64
 32  OPPONENTS_THREE_POINTERS_PER_GAME                    1673 non-null   float64
 33  PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME           1673 non-null   float64
 34  OPPONENTS_THREE_POINTERS_PERCENTAGE                  1673 non-null   float64
 35  OPPONENTS_TWO_POINTERS_PER_GAME                      1817 non-null   float64
 36  OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME            1817 non-null   float64
 37  OPPONENTS_TWO_POINTERS_PERCENTAGE                    1817 non-null   float64
 38  OPPONENTS_FREE_THROWS_PER_GAME                       1817 non-null   float64
 39  OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME             1817 non-null   float64
 40  OPPONENTS_FREE_THROWS_PERCENTAGE                     1817 non-null   float64
 41  OPPONENTS_OFFENSIVE_REBOUND_PER_GAME                 1817 non-null   float64
 42  OPPONENTS_DEFENSIVE_REBOUND_PER_GAME                 1817 non-null   float64
 43  OPPONENTS_TOTAL_REBOUND_PER_GAME                     1817 non-null   float64
 44  OPPONENTS_ASSIST_PER_GAME                            1817 non-null   float64
 45  OPPONENTS_STEAL_PER_GAME                             1817 non-null   float64
 46  OPPONENTSBLOCKS_PER_GAME                             1817 non-null   float64
 47  OPPONENTS_TURNOVER_PER_GAME                          1817 non-null   float64
 48  OPPONENTS_PERSONAL_FOUL_PER_GAME                     1817 non-null   float64
 49  OPPONENTS_POINTS_PER_GAME                            1817 non-null   float64
 50  OPPONENT_FIELD_GOALS                                 1406 non-null   float64
 51  OPPONENT_FIELD_GOALS_ATTEMPTED                       1406 non-null   float64
 52  OPPPONENT_FIELD_GOALS_PERCENTAGE                     1406 non-null   float64
 53  OPPONENT_THREE_POINTERS                              1283 non-null   float64
 54  OPPONENT_THREE_POINTERS_ATTEMPTED                    1283 non-null   float64
 55  OPPONENT_TWO_POINTERS                                1406 non-null   float64
 56  OPPONENT_TWO_POINTERS_ATTEMPTED                      1406 non-null   float64
 57  OPPONENT_FREE_THROWS                                 1406 non-null   float64
 58  OPPONENT_FREE_THROWS_ATTEMPTED                       1406 non-null   float64
 59  OPPONENT_FREE_THROW_PERCENTAGE                       1406 non-null   float64
 60  OPPONENTS_OFFENSIVE_REBOUND                          1406 non-null   float64
 61  OPPONENTS_DEFENSIVE_REBOUND                          1406 non-null   float64
 62  OPPONENTS_TOTAL_REBOUND                              1406 non-null   float64
 63  OPPONENTS_ASSIST                                     1406 non-null   float64
 64  OPPONENTS_STEALS                                     1406 non-null   float64
 65  OPPONENTS_BLOCKS                                     1406 non-null   float64
 66  OPPONENTS_TURNOVERS                                  1406 non-null   float64
 67  OPPONENTS_PERSONAL_FOUL                              1406 non-null   float64
 68  OPPONENTS_POINTS                                     1406 non-null   float64
dtypes: bool(1), float64(61), int64(3), object(4)
memory usage: 967.2+ KB
In [39]:
del opponent_team_df['TEAM_ABBREAVIATION']
In [40]:
opponent_team_df.columns.tolist()
Out[40]:
['SEASON',
 'LEAGUE',
 'TEAM_ABBREVIATION',
 'PLAYOFFS',
 'GAMES',
 'MINUTES_PLAYED',
 'OPPONENT_FIELD_GOALS_PER_100_POSSESION',
 'OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_FIELD_GOALS_PERCENTAGE',
 'OPPONENT_THREE_POINTERS_PER_100_POSSESION',
 'OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_THREE_POINTERS_PERCENTAGE',
 'OPPONENT_TWO_POINTER_PER_100_POSSESION',
 'OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_TWO_POINTERS_PERCENTAGE',
 'OPPONENT_FREE_THROWS_PER_100_POSSESION',
 'OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_FREE_THROWS_PERCENTAGE',
 'OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION',
 'OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION',
 'OPPONENT_TOTAL_REBOUND_PER_100_POSSESION',
 'OPPONENT_ASSIST_PER_100_POSSESION',
 'OPPONENT_STEAL_PER_100_POSSESION',
 'OPPONENT_STEAL_PER_100_POSSESION.1',
 'OPPONENT_TURNOVER_PER_100_POSSESION',
 'OPPONENT_PERSONAL_FOUL_PER_100_POSSESION',
 'OPPONENT_POINTS_PER_100_POSSESION',
 'TEAM',
 'MINUTES_PLAYED_PER_GAME',
 'OPPONENTS_FIELD_GOALS_PER_GAME',
 'OPPONENTS_FIELD_GOALS_Attempted_per_game',
 'OPPONENTS_THREE_POINTERS_PER_GAME',
 'PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME',
 'OPPONENTS_THREE_POINTERS_PERCENTAGE',
 'OPPONENTS_TWO_POINTERS_PER_GAME',
 'OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME',
 'OPPONENTS_TWO_POINTERS_PERCENTAGE',
 'OPPONENTS_FREE_THROWS_PER_GAME',
 'OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME',
 'OPPONENTS_FREE_THROWS_PERCENTAGE',
 'OPPONENTS_OFFENSIVE_REBOUND_PER_GAME',
 'OPPONENTS_DEFENSIVE_REBOUND_PER_GAME',
 'OPPONENTS_TOTAL_REBOUND_PER_GAME',
 'OPPONENTS_ASSIST_PER_GAME',
 'OPPONENTS_STEAL_PER_GAME',
 'OPPONENTSBLOCKS_PER_GAME',
 'OPPONENTS_TURNOVER_PER_GAME',
 'OPPONENTS_PERSONAL_FOUL_PER_GAME',
 'OPPONENTS_POINTS_PER_GAME',
 'OPPONENT_FIELD_GOALS',
 'OPPONENT_FIELD_GOALS_ATTEMPTED',
 'OPPPONENT_FIELD_GOALS_PERCENTAGE',
 'OPPONENT_THREE_POINTERS',
 'OPPONENT_THREE_POINTERS_ATTEMPTED',
 'OPPONENT_TWO_POINTERS',
 'OPPONENT_TWO_POINTERS_ATTEMPTED',
 'OPPONENT_FREE_THROWS',
 'OPPONENT_FREE_THROWS_ATTEMPTED',
 'OPPONENT_FREE_THROW_PERCENTAGE',
 'OPPONENTS_OFFENSIVE_REBOUND',
 'OPPONENTS_DEFENSIVE_REBOUND',
 'OPPONENTS_TOTAL_REBOUND',
 'OPPONENTS_ASSIST',
 'OPPONENTS_STEALS',
 'OPPONENTS_BLOCKS',
 'OPPONENTS_TURNOVERS',
 'OPPONENTS_PERSONAL_FOUL',
 'OPPONENTS_POINTS']
In [41]:
opponent_team_df = opponent_team_df.drop([
 'LEAGUE',
 'OPPONENT_FIELD_GOALS_PER_100_POSSESION',
 'OPPONENT_FIELD_GOALS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_FIELD_GOALS_PERCENTAGE',
 'OPPONENT_THREE_POINTERS_PER_100_POSSESION',
 'OPPONENT_THREE_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_THREE_POINTERS_PERCENTAGE',
 'OPPONENT_TWO_POINTER_PER_100_POSSESION',
 'OPPONENT_TWO_POINTERS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_TWO_POINTERS_PERCENTAGE',
 'OPPONENT_FREE_THROWS_PER_100_POSSESION',
 'OPPONENT_FREE_THROWS_ATTEMPTED_PER_100_POSSESION',
 'OPPONENT_FREE_THROWS_PERCENTAGE',
 'OPPONENT_OFFENSIVE_REBOUND_PER_100_POSSESION',
 'OPPONENT_DEFENSIVE_REBOUND_PER_100_POSSESION',
 'OPPONENT_TOTAL_REBOUND_PER_100_POSSESION',
 'OPPONENT_ASSIST_PER_100_POSSESION',
 'OPPONENT_STEAL_PER_100_POSSESION',
 'OPPONENT_STEAL_PER_100_POSSESION.1',
 'OPPONENT_TURNOVER_PER_100_POSSESION',
 'OPPONENT_PERSONAL_FOUL_PER_100_POSSESION',
 'OPPONENT_POINTS_PER_100_POSSESION',
 'MINUTES_PLAYED_PER_GAME',
 'PPONENTS_THREE_POINTERS_ATTEMPTED_PER_GAME',
 'OPPONENTS_TWO_POINTERS_ATTEMPTED_PER_GAME',
 'OPPONENTS_FREE_THROWS_ATTEMPTED_PER_GAME',
 'OPPONENTS_OFFENSIVE_REBOUND_PER_GAME',
 'OPPONENTS_DEFENSIVE_REBOUND_PER_GAME',
 'OPPONENT_THREE_POINTERS_ATTEMPTED',
 'OPPONENT_TWO_POINTERS_ATTEMPTED',
 'OPPONENT_FREE_THROW_PERCENTAGE',
 ], axis=1)
In [42]:
opponent_team_df.fillna(0, inplace=True)
In [43]:
opponent_team_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1817 entries, 0 to 1816
Data columns (total 37 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   SEASON                                    1817 non-null   int64  
 1   TEAM_ABBREVIATION                         1817 non-null   object 
 2   PLAYOFFS                                  1817 non-null   bool   
 3   GAMES                                     1817 non-null   int64  
 4   MINUTES_PLAYED                            1817 non-null   int64  
 5   TEAM                                      1817 non-null   object 
 6   OPPONENTS_FIELD_GOALS_PER_GAME            1817 non-null   float64
 7   OPPONENTS_FIELD_GOALS_Attempted_per_game  1817 non-null   float64
 8   OPPONENTS_THREE_POINTERS_PER_GAME         1817 non-null   float64
 9   OPPONENTS_THREE_POINTERS_PERCENTAGE       1817 non-null   float64
 10  OPPONENTS_TWO_POINTERS_PER_GAME           1817 non-null   float64
 11  OPPONENTS_TWO_POINTERS_PERCENTAGE         1817 non-null   float64
 12  OPPONENTS_FREE_THROWS_PER_GAME            1817 non-null   float64
 13  OPPONENTS_FREE_THROWS_PERCENTAGE          1817 non-null   float64
 14  OPPONENTS_TOTAL_REBOUND_PER_GAME          1817 non-null   float64
 15  OPPONENTS_ASSIST_PER_GAME                 1817 non-null   float64
 16  OPPONENTS_STEAL_PER_GAME                  1817 non-null   float64
 17  OPPONENTSBLOCKS_PER_GAME                  1817 non-null   float64
 18  OPPONENTS_TURNOVER_PER_GAME               1817 non-null   float64
 19  OPPONENTS_PERSONAL_FOUL_PER_GAME          1817 non-null   float64
 20  OPPONENTS_POINTS_PER_GAME                 1817 non-null   float64
 21  OPPONENT_FIELD_GOALS                      1817 non-null   float64
 22  OPPONENT_FIELD_GOALS_ATTEMPTED            1817 non-null   float64
 23  OPPPONENT_FIELD_GOALS_PERCENTAGE          1817 non-null   float64
 24  OPPONENT_THREE_POINTERS                   1817 non-null   float64
 25  OPPONENT_TWO_POINTERS                     1817 non-null   float64
 26  OPPONENT_FREE_THROWS                      1817 non-null   float64
 27  OPPONENT_FREE_THROWS_ATTEMPTED            1817 non-null   float64
 28  OPPONENTS_OFFENSIVE_REBOUND               1817 non-null   float64
 29  OPPONENTS_DEFENSIVE_REBOUND               1817 non-null   float64
 30  OPPONENTS_TOTAL_REBOUND                   1817 non-null   float64
 31  OPPONENTS_ASSIST                          1817 non-null   float64
 32  OPPONENTS_STEALS                          1817 non-null   float64
 33  OPPONENTS_BLOCKS                          1817 non-null   float64
 34  OPPONENTS_TURNOVERS                       1817 non-null   float64
 35  OPPONENTS_PERSONAL_FOUL                   1817 non-null   float64
 36  OPPONENTS_POINTS                          1817 non-null   float64
dtypes: bool(1), float64(31), int64(3), object(2)
memory usage: 512.9+ KB
In [44]:
opponent_team_df = opponent_team_df.rename(columns={'OPPONENTSBLOCKS_PER_GAME': 'OPPONENTS_BLOCKS_PER_GAME'})
In [45]:
opponent_team_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1817 entries, 0 to 1816
Data columns (total 37 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   SEASON                                    1817 non-null   int64  
 1   TEAM_ABBREVIATION                         1817 non-null   object 
 2   PLAYOFFS                                  1817 non-null   bool   
 3   GAMES                                     1817 non-null   int64  
 4   MINUTES_PLAYED                            1817 non-null   int64  
 5   TEAM                                      1817 non-null   object 
 6   OPPONENTS_FIELD_GOALS_PER_GAME            1817 non-null   float64
 7   OPPONENTS_FIELD_GOALS_Attempted_per_game  1817 non-null   float64
 8   OPPONENTS_THREE_POINTERS_PER_GAME         1817 non-null   float64
 9   OPPONENTS_THREE_POINTERS_PERCENTAGE       1817 non-null   float64
 10  OPPONENTS_TWO_POINTERS_PER_GAME           1817 non-null   float64
 11  OPPONENTS_TWO_POINTERS_PERCENTAGE         1817 non-null   float64
 12  OPPONENTS_FREE_THROWS_PER_GAME            1817 non-null   float64
 13  OPPONENTS_FREE_THROWS_PERCENTAGE          1817 non-null   float64
 14  OPPONENTS_TOTAL_REBOUND_PER_GAME          1817 non-null   float64
 15  OPPONENTS_ASSIST_PER_GAME                 1817 non-null   float64
 16  OPPONENTS_STEAL_PER_GAME                  1817 non-null   float64
 17  OPPONENTS_BLOCKS_PER_GAME                 1817 non-null   float64
 18  OPPONENTS_TURNOVER_PER_GAME               1817 non-null   float64
 19  OPPONENTS_PERSONAL_FOUL_PER_GAME          1817 non-null   float64
 20  OPPONENTS_POINTS_PER_GAME                 1817 non-null   float64
 21  OPPONENT_FIELD_GOALS                      1817 non-null   float64
 22  OPPONENT_FIELD_GOALS_ATTEMPTED            1817 non-null   float64
 23  OPPPONENT_FIELD_GOALS_PERCENTAGE          1817 non-null   float64
 24  OPPONENT_THREE_POINTERS                   1817 non-null   float64
 25  OPPONENT_TWO_POINTERS                     1817 non-null   float64
 26  OPPONENT_FREE_THROWS                      1817 non-null   float64
 27  OPPONENT_FREE_THROWS_ATTEMPTED            1817 non-null   float64
 28  OPPONENTS_OFFENSIVE_REBOUND               1817 non-null   float64
 29  OPPONENTS_DEFENSIVE_REBOUND               1817 non-null   float64
 30  OPPONENTS_TOTAL_REBOUND                   1817 non-null   float64
 31  OPPONENTS_ASSIST                          1817 non-null   float64
 32  OPPONENTS_STEALS                          1817 non-null   float64
 33  OPPONENTS_BLOCKS                          1817 non-null   float64
 34  OPPONENTS_TURNOVERS                       1817 non-null   float64
 35  OPPONENTS_PERSONAL_FOUL                   1817 non-null   float64
 36  OPPONENTS_POINTS                          1817 non-null   float64
dtypes: bool(1), float64(31), int64(3), object(2)
memory usage: 512.9+ KB
In [46]:
opponent_team_df['SEASON'].unique()
Out[46]:
array([2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014,
       2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003,
       2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992,
       1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981,
       1980, 1979, 1978, 1977, 1976, 1975, 1974])
In [47]:
opponent_team_df['TEAM'].unique()
Out[47]:
array(['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets',
       'Chicago Bulls', 'Charlotte Hornets', 'Cleveland Cavaliers',
       'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons',
       'Portland Trail Blazers', 'Golden State Warriors',
       'Houston Rockets', 'Indiana Pacers', 'Los Angeles Clippers',
       'Los Angeles Lakers', 'Memphis Grizzlies', 'Miami Heat',
       'Milwaukee Bucks', 'Minnesota Timberwolves',
       'New Orleans Pelicans', 'New York Knicks', 'Oklahoma City Thunder',
       'Orlando Magic', 'Philadelphia 76ers', 'Phoenix Suns',
       'Sacramento Kings', 'San Antonio Spurs', 'Toronto Raptors',
       'Utah Jazz', 'Washington Wizards', 'League Average',
       'Charlotte Bobcats', 'New Orleans Hornets', 'New Jersey Nets',
       'Seattle SuperSonics', 'New Orleans/Oklahoma City Hornets',
       'Vancouver Grizzlies', 'Washington Bullets', 'Kansas City Kings',
       'San Diego Clippers', 'New Orleans Jazz', 'Buffalo Braves',
       'New York Nets', 'Kentucky Colonels', 'San Diego Sails',
       'Spirits of St. Louis', 'Utah Stars', 'Virginia Squires',
       'Kansas City-Omaha Kings', 'Memphis Sounds',
       'San Diego Conquistadors', 'Capital Bullets', 'Carolina Cougars',
       'Denver Rockets', 'Memphis Tams'], dtype=object)

Although this DataFrame didn't significantly influence our final model, we took a moment to investigate whether there was a substantial variance in the number of injured players per team. We hypothesized that this might impact a team's overall performance and, consequently, their likelihood of winning. It's worth noting that the available data only covers a decade (2010-2020).

In [48]:
injuries_df = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/injuries_2010-2020.csv')
In [49]:
injuries_df.shape
Out[49]:
(27105, 5)
In [50]:
injuries_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27105 entries, 0 to 27104
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   DATE          27105 non-null  object
 1   TEAM          27103 non-null  object
 2   ACQUIRED      9542 non-null   object
 3   RELINQUISHED  17560 non-null  object
 4   NOTES         27105 non-null  object
dtypes: object(5)
memory usage: 1.0+ MB
In [51]:
injuries_df['TEAM'].unique()
Out[51]:
array(['Bulls', 'Pistons', 'Blazers', 'Nets', 'Nuggets', 'Bucks', 'Kings',
       'Bobcats', 'Warriors', 'Suns', 'Heat', 'Thunder', 'Timberwolves',
       'Celtics', 'Lakers', 'Rockets', '76ers', 'Cavaliers', 'Clippers',
       'Grizzlies', 'Hawks', 'Hornets', 'Jazz', 'Knicks', 'Mavericks',
       'Pacers', 'Raptors', 'Spurs', 'Magic', 'Wizards', 'Pelicans', nan,
       'Bullets'], dtype=object)
In [52]:
injuries_df['TEAM'].value_counts()
Out[52]:
TEAM
Spurs           1163
Bucks           1068
Warriors        1060
Rockets         1058
Raptors         1044
Celtics         1040
Nets            1024
Heat            1023
Cavaliers       1001
Mavericks        992
Hawks            975
Nuggets          966
Lakers           959
Knicks           943
76ers            910
Wizards          875
Grizzlies        875
Timberwolves     860
Jazz             841
Magic            834
Pacers           831
Bulls            791
Suns             733
Kings            728
Hornets          719
Clippers         718
Thunder          717
Pistons          714
Blazers          695
Pelicans         576
Bobcats          369
Bullets            1
Name: count, dtype: int64
In [53]:
pd.crosstab(index=injuries_df['TEAM'], columns="count")
Out[53]:
col_0 count
TEAM
76ers 910
Blazers 695
Bobcats 369
Bucks 1068
Bullets 1
Bulls 791
Cavaliers 1001
Celtics 1040
Clippers 718
Grizzlies 875
Hawks 975
Heat 1023
Hornets 719
Jazz 841
Kings 728
Knicks 943
Lakers 959
Magic 834
Mavericks 992
Nets 1024
Nuggets 966
Pacers 831
Pelicans 576
Pistons 714
Raptors 1044
Rockets 1058
Spurs 1163
Suns 733
Thunder 717
Timberwolves 860
Warriors 1060
Wizards 875

A visual representation of the distribution of wounded players across teams.

In [54]:
injuries_df['TEAM'].hist()
plt.xlabel('Team Name')
plt.ylabel('Number of Injuries')
plt.title('Histogram of Number of Injuries by Team')
plt.xticks(rotation='vertical')
plt.show()

As we delve into the draft_history dataset with df_draft_history.info(), we’re getting a glimpse of the different chapters of this story. Each column represents a different aspect of the draft history,including the number of non-null entries in each column and the data type of each column. The df_draft_history.head() It gives us a sneak peek into the data.

In [55]:
df_draft_history = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/draft_history.csv')
df_draft_combine_stats =pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/draft_combine_stats.csv')
In [56]:
df_draft_history.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7990 entries, 0 to 7989
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   PERSON_ID            7990 non-null   int64 
 1   PLAYER_NAME          7990 non-null   object
 2   SEASON               7990 non-null   int64 
 3   ROUND_NUMBER         7990 non-null   int64 
 4   ROUND_PICK           7990 non-null   int64 
 5   OVERALL_PICK         7990 non-null   int64 
 6   DRAFT_TYPE           7990 non-null   object
 7   TEAM_ID              7990 non-null   int64 
 8   TEAM_CITY            7990 non-null   object
 9   TEAM_NAME            7990 non-null   object
 10  TEAM_ABBREVIATION    7990 non-null   object
 11  ORGANIZATION         7971 non-null   object
 12  ORGANIZATION_TYPE    7971 non-null   object
 13  PLAYER_PROFILE_FLAG  7990 non-null   int64 
dtypes: int64(7), object(7)
memory usage: 874.0+ KB
In [57]:
df_draft_combine_stats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1202 entries, 0 to 1201
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   SEASON                            1202 non-null   int64  
 1   PLAYER_ID                         1202 non-null   int64  
 2   FIRST_NAME                        1202 non-null   object 
 3   LAST_NAME                         1202 non-null   object 
 4   PLAYER_NAME                       1202 non-null   object 
 5   POSITION                          1197 non-null   object 
 6   HEIGHT_WITHOUT_SHOEAS             1153 non-null   float64
 7   HEIGHT_WITHOUT_SHOES_FEET_INCHES  1153 non-null   object 
 8   HEIGHT_WITH_SHOES                 1008 non-null   float64
 9   HEIGHT_WITH_SHOES_FEET_INCHES     1008 non-null   object 
 10  WEIGHT                            1152 non-null   float64
 11  WINGSPAN                          1153 non-null   float64
 12  WINGSPAN_FEET_INCHES              1153 non-null   object 
 13  STANDING_REACH                    1152 non-null   float64
 14  STANDING_REACH_FT_INCHES          1152 non-null   object 
 15  BODY_FAT_PERCENTAGE               1003 non-null   float64
 16  HAND_LENGHT                       719 non-null    float64
 17  HAND_WIDTH                        719 non-null    float64
 18  STANDING_VERTICAL_LEAP            1017 non-null   float64
 19  MAX_VERTICAL_LEAP                 1017 non-null   float64
 20  LANE_AGILITY_TIME                 1008 non-null   float64
 21  MODIFIED_LANE_AGILITY_TIME        411 non-null    float64
 22  THREE_QUARTER_SPRINT              1012 non-null   float64
 23  BENCH_PRESSES                     808 non-null    float64
 24  SPOT_FIFTEEN_CORNER_LEFT          74 non-null     object 
 25  SPOT_FIFTEEN_BREAK_LEFT           76 non-null     object 
 26  SPOT_FIFTEEN_TOP_KEY              76 non-null     object 
 27  SPOT_FIFTEEN_BREAK_RIGHT          76 non-null     object 
 28  SPOT_FIFTEEN_CORNER_RIGHT         76 non-null     object 
 29  SPOT_COLLEGE_CORNER_LEFT          245 non-null    object 
 30  SPOT_COLLEGE_BREAK_LEFT           166 non-null    object 
 31  SPOT_COLLEGE_TOP_KEY              166 non-null    object 
 32  SPOT_COLLEGE_BREAK_RIGHT          166 non-null    object 
 33  SPOT_COLLEGE_CORNER_RIGHT         166 non-null    object 
 34  SPOT_NBA_CORNER_LEFT              217 non-null    object 
 35  SPOT_NBA_BREAK_LEFT               217 non-null    object 
 36  SPOT_NBA_TOP_KEY                  217 non-null    object 
 37  SPOT_NBA_BREAK_RIGHT              217 non-null    object 
 38  SPOT_NBA_CORNER_RIGHT             217 non-null    object 
 39  OFF_DRIBBLE_FIFTEEN_BREAK_LEFT    166 non-null    object 
 40  OFF_DRIBBLE_FIFTEEN_TOP_KEY       166 non-null    object 
 41  OFF_DRIBBLE_FIFTEEN_BREAK_RIGHT   166 non-null    object 
 42  OFF_DRIBBLE_COLLEGE_BREAK_LEFT    110 non-null    object 
 43  OFF_DRIBBLE_COLLEGE_TOP-KEY       31 non-null     object 
 44  OFF_DRIBBLE_COLLEGE_BREAK_RIGHT   31 non-null     object 
 45  ON_MOVE_FIFTEEN                   148 non-null    object 
 46  ON_MOVE_COLLEGE                   116 non-null    object 
dtypes: float64(14), int64(2), object(31)
memory usage: 441.5+ KB
In [58]:
df_draft_history.head()
Out[58]:
PERSON_ID PLAYER_NAME SEASON ROUND_NUMBER ROUND_PICK OVERALL_PICK DRAFT_TYPE TEAM_ID TEAM_CITY TEAM_NAME TEAM_ABBREVIATION ORGANIZATION ORGANIZATION_TYPE PLAYER_PROFILE_FLAG
0 79299 Clifton McNeeley 1947 1 1 1 Draft 1610610031 Pittsburgh Ironmen PIT Texas-El Paso College/University 0
1 78109 Glen Selbo 1947 1 2 2 Draft 1610610035 Toronto Huskies HUS Wisconsin College/University 1
2 76649 Eddie Ehlers 1947 1 3 3 Draft 1610612738 Boston Celtics BOS Purdue College/University 1
3 79302 Walt Dropo 1947 1 4 4 Draft 1610610032 Providence Steamrollers PRO Connecticut College/University 0
4 77048 Dick Holub 1947 1 5 5 Draft 1610612752 New York Knicks NYK Long Island-Brooklyn College/University 1
In [59]:
df_draft_combine_stats.head()
Out[59]:
SEASON PLAYER_ID FIRST_NAME LAST_NAME PLAYER_NAME POSITION HEIGHT_WITHOUT_SHOEAS HEIGHT_WITHOUT_SHOES_FEET_INCHES HEIGHT_WITH_SHOES HEIGHT_WITH_SHOES_FEET_INCHES ... SPOT_NBA_BREAK_RIGHT SPOT_NBA_CORNER_RIGHT OFF_DRIBBLE_FIFTEEN_BREAK_LEFT OFF_DRIBBLE_FIFTEEN_TOP_KEY OFF_DRIBBLE_FIFTEEN_BREAK_RIGHT OFF_DRIBBLE_COLLEGE_BREAK_LEFT OFF_DRIBBLE_COLLEGE_TOP-KEY OFF_DRIBBLE_COLLEGE_BREAK_RIGHT ON_MOVE_FIFTEEN ON_MOVE_COLLEGE
0 2001 12033 Adam Allenspach Adam Allenspach C 83.50 6' 11.5'' NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2001 2240 Gilbert Arenas Gilbert Arenas SG 74.25 6' 2.25'' NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2001 2220 Brandon Armstrong Brandon Armstrong SG 75.50 6' 3.5'' NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2001 2203 Shane Battier Shane Battier SF-PF 80.25 6' 8.25'' NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2001 12034 Cookie Belcher Cookie Belcher SG-PG 75.00 6' 3'' NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 47 columns

Now, the plot thickens. This line is like a leaderboard, ranking the top 20 organizations based on their frequency in the draft history. It’s a testament to their consistent presence and impact in the drafts.

In [60]:
top_20_organizations = df_draft_history['ORGANIZATION'].value_counts().head(20)
print(top_20_organizations)
ORGANIZATION
Kentucky                  137
California-Los Angeles    120
North Carolina            115
Duke                      105
Kansas                     87
Indiana                    77
Arizona                    74
Louisville                 73
Michigan                   73
Maryland                   72
Illinois                   71
Syracuse                   69
North Carolina State       66
Notre Dame                 63
Southern California        62
Villanova                  61
St. John's (NY)            61
Washington                 60
Michigan State             60
Marquette                  58
Name: count, dtype: int64

But what about the underdogs? df_draft_history['ORGANIZATION'].value_counts().tail(20) gives us insight into the organizations that have had less representation in the drafts. This could be the beginning of their journey to rise through the ranks.

In [61]:
df_draft_history['ORGANIZATION'].value_counts().tail(20)
Out[61]:
ORGANIZATION
West Alabama                  1
Bloomsburg                    1
Southern Utah                 1
Juvecaserta Basket (Italy)    1
Calgary (CAN)                 1
Longwood                      1
Eastern Oregon                1
Florida International         1
Massachusetts Boston          1
California-Davis              1
Hellenic                      1
Lawrence                      1
Alabama A&M                   1
Pratt                         1
Williams                      1
Caltech                       1
McMurry                       1
York (CAN)                    1
Indiana Wesleyan              1
Belhaven                      1
Name: count, dtype: int64

The print(df_draft_history['OVERALL_PICK'].value_counts()) command is like a roll call, tallying up the frequency of each overall pick number. It’s a snapshot of the distribution of picks across the drafts.

In [62]:
print(df_draft_history['OVERALL_PICK'].value_counts())
OVERALL_PICK
0      655
1       74
2       74
3       74
4       74
      ... 
232      2
231      2
230      2
238      1
239      1
Name: count, Length: 240, dtype: int64

Next, we turn our attention to the top 20 organizations This line is calculating the average overall pick for each of these organizations, shedding light on their typical standing in the drafts.

The bar graph that follows paints a vivid picture of this data. It’s a visual representation of the average overall pick by the top 20 organizations, allowing us to easily compare and contrast their performances.

In [63]:
top_20_orgs = df_draft_history['ORGANIZATION'].value_counts().index[:20]


average_pick = df_draft_history[df_draft_history['ORGANIZATION'].isin(top_20_orgs)].groupby('ORGANIZATION')['OVERALL_PICK'].mean()

plt.figure(figsize=(10,8))
average_pick.plot(kind='bar')
plt.xlabel('Organization')
plt.ylabel('Average Overall Pick')
plt.title('Bar Graph of Average Overall Pick by Top 20 Organizations')
plt.xticks(rotation=90)
plt.show()

Finally, we zoom in on the coveted first overall pick. This line is spotlighting the entries where the overall pick was 1, the pinnacle of achievement in the drafts.

The pie chart at the end of your code is the grand finale. It showcases the top 10 organizations that have had the honor of the first overall pick the most times, with all other organizations grouped into an ‘Other’ category. It’s a tribute to their success and a testament to their dominance in the drafts.

This could lead to further questions and investigations, such as looking into why those organizations have those patterns (e.g., more successful scouting, better player development, etc.).

In [64]:
df_overall_pick_1 = df_draft_history[df_draft_history['OVERALL_PICK'] == 1]

org_counts = df_overall_pick_1['ORGANIZATION'].value_counts()

# Get the top 20 organizations
top_20_orgs_counts = org_counts.nlargest(10)

# Add a new category for all other organizations
top_20_orgs_counts['Other'] = org_counts[10:].sum()

plt.figure(figsize=(10,8))  # Adjust the size as necessary
plt.pie(top_20_orgs_counts, labels=top_20_orgs_counts.index, autopct='%1.1f%%')
plt.title('Pie Chart of Top 20 Organizations with Overall Pick #1 (and Others)')
plt.show()
In [65]:
all_star = pd.read_csv('/content/drive/MyDrive/FINAL PROJECT CSVs/All-Star Selections.csv')
In [66]:
all_star.shape
Out[66]:
(2000, 5)
In [67]:
all_star.describe()
Out[67]:
SEASON
count 2000.000000
mean 1987.016000
std 20.730307
min 1951.000000
25% 1970.000000
50% 1985.000000
75% 2005.250000
max 2024.000000
In [68]:
all_star.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   PLAYER    2000 non-null   object
 1   TEAM      2000 non-null   object
 2   LEAGUE    2000 non-null   object
 3   SEASON    2000 non-null   int64 
 4   REPLACED  2000 non-null   bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 64.6+ KB
In [69]:
all_star.head()
Out[69]:
PLAYER TEAM LEAGUE SEASON REPLACED
0 Bam Adebayo East NBA 2024 False
1 Giannis Antetokounmpo East NBA 2024 False
2 Paolo Banchero East NBA 2024 False
3 Scottie Barnes East NBA 2024 False
4 Jaylen Brown East NBA 2024 False
In [70]:
all_star.tail()
Out[70]:
PLAYER TEAM LEAGUE SEASON REPLACED
1995 Alex Groza West NBA 1951 False
1996 George Mikan West NBA 1951 False
1997 Vern Mikkelsen West NBA 1951 False
1998 Jim Pollard West NBA 1951 False
1999 Fred Schaus West NBA 1951 False
In [71]:
all_star['TEAM'].value_counts()
Out[71]:
TEAM
East            909
West            907
Team LeBron      67
Team Durant      29
Team Giannis     26
LeBron           14
Giannis          13
Team Stephen     12
All Stars        12
Denver           11
Name: count, dtype: int64
In [72]:
all_star = all_star[all_star['TEAM'] != 'Denver']
In [73]:
all_star['TEAM'].value_counts()
Out[73]:
TEAM
East            909
West            907
Team LeBron      67
Team Durant      29
Team Giannis     26
LeBron           14
Giannis          13
Team Stephen     12
All Stars        12
Name: count, dtype: int64
In [74]:
all_star['PLAYER'].value_counts()
Out[74]:
PLAYER
LeBron James           20
Kareem Abdul-Jabbar    19
Kobe Bryant            18
Julius Erving          16
Tim Duncan             15
                       ..
Michael Redd            1
Kenyon Martin           1
Jamaal Magloire         1
Metta World Peace       1
Fred Schaus             1
Name: count, Length: 520, dtype: int64

The bar plot is a visual representation of the player counts, allowing us to easily compare and contrast their performances. It’s like a hall of fame, honoring the players who have made it to the All-Star game the most times. This could lead to further questions and investigations, such as looking into why those players are frequently selected (e.g., outstanding performance, popularity, etc.).

In [75]:
team_counts = all_star['PLAYER'].value_counts().sort_values(ascending=False).head(50)

plt.figure(figsize=(10, 5))
#plt.pie(team_counts, labels=team_counts.index, autopct='%1.1f%%')
plt.bar(team_counts.index, team_counts.values, color='b')
plt.xlabel('Player Name')
plt.ylabel('Number of Occurrences')
plt.title('All-Star Appearances')
plt.xticks(rotation='vertical')

plt.show()

We'll now deal with teh Teams dataframes

In [76]:
#Importing all dataframes for the teams that will be needed to create a unified team dataframe
teams_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/teams.csv") #ONLY HAS BASIC INFO
team_totals_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/Team Totals.csv")
team_sum_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/Team Summaries.csv")
team_statPG_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/Team Stats Per Game.csv")
In [77]:
print(team_sum_df.shape)
(1845, 31)
In [78]:
print(team_sum_df.notnull().sum())
SEASON                                   1845
LEAGUE                                   1845
TEAM                                     1845
ABBREVIATION                             1758
PLAYOFFS                                 1845
AGE                                      1781
W                                        1757
L                                        1757
PW                                       1844
PL                                       1844
MARGIN_OF_VICTORY                        1844
STRENGTH_OF_SCHEDULE                     1844
SIMPLE_RATING_SYSTEM                     1844
OFFENSIVE_RATING                         1792
DEFENSIVE_RATING                         1792
NET_RATING                               1709
PACE                                     1792
FREE_THROW_Rate                          1844
THREE_POINTER_ATTEMPT_RATE               1402
TRUE_SHOOTING_PERCENTAGE                 1844
EFFECTIEV_FIELD_GOAL_PERCENTAGE          1844
TURNOVER_PERCENTAGE                      1527
OFFENSIVE_REBOUND_PERCENTAGE             1479
FREE_THROWS_PER_FIELD_GOAL_ATTEMPT       1844
OPPONENT_EFFECTIVE_FIELD_GAL             1581
OPPONENT_FREE_THROW_PERCENTAGE           1527
OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE    1479
OPPONENT_FREE_THROW_PERCENTAGE.1         1581
ARENA                                    1757
ATTEND                                   1360
ATTEND_G                                  967
dtype: int64
In [79]:
# Do a left merge on the dataframes on SEASON, LEAGUE, and TEAM columns
# This is done to assert all dataframes have a corresponding win and loss column
team_totals_df = pd.merge(team_totals_df, team_sum_df[['SEASON', 'LEAGUE', 'TEAM', 'W', 'L']], on=['SEASON', 'LEAGUE', 'TEAM'], how='left')

# Display the merged dataframe
print(team_totals_df.notnull().sum())
SEASON                       1845
LEAGUE                       1845
TEAM                         1845
ABBREVIATION                 1758
PLAYOFFS                     1845
GAMES                        1844
MINUTES_PLAYED               1655
FIELD_GOALS_MADE             1844
FIELD_GOALS_ATTEMPTED        1844
FIELD_GOALS_PERCENTAGE       1844
THREE_POINTERS_MADE          1402
THREE_POINTERS_ATTEMPTED     1402
THREE_POINTERS_PERCENTAGE    1402
TWO_POINTERS_MADE            1844
TWO_POINTERS_ATTEMPTED       1844
TWO_POINTERS_PERCENTAGE      1844
FREE_THROWS_MADE             1844
FREE_THROWS_ATTEMPTED        1844
FREE_THROW_PERCENTAGE        1844
OFFENSIVE_REBOUND            1515
DEFENSIVE_REBOUND            1515
TOTAL_REBOUNDS               1792
ASSISTS                      1844
STEALS                       1456
BLOCKS                       1456
TURNOVERS                    1527
PERSONAL FOULS               1844
POINTS                       1844
W                            1757
L                            1757
dtype: int64
In [80]:
# Do a left merge on the dataframes on SEASON, LEAGUE, and TEAM columns
# This is done to assert all dataframes have a corresponding win and loss column
team_statPG_df = pd.merge(team_statPG_df, team_sum_df[['SEASON', 'LEAGUE', 'TEAM', 'W', 'L']], on=['SEASON', 'LEAGUE', 'TEAM'], how='left')

# Display the merged dataframe
print(team_statPG_df.notnull().sum())
SEASON                             1845
LEAGUE                             1845
TEAM                               1845
ABBREVIATION                       1758
PLAYOFFS                           1845
GAMES                              1844
mp_per_game                        1655
fg_per_game                        1844
FIELD_GOAL_AGAINST_PER_GAME        1844
FIELD_GOAL_PERCENTAGE              1844
THREE_POINTERS_PER_GAME            1402
THREE_POINTERS_AGAINST_PER_GAME    1402
THREE_POINTER_POSITION             1402
TWO_POINTERS_PER_GAME              1844
TWO_POINTER_AGAINST_PER_GAME       1844
TWO_POINTER_PERCENTAGE             1844
FREE_THROW_PER_GAME                1844
FREE_THROW_AGAINST_PER_GAME        1844
FREE_THROW_PERCENTAGE              1844
OFFENSIVE_Rebound_per_game         1515
DEFENSIVE_REBOUND_PER_GAME         1515
TOTAL_REBOUND_PER_GAME             1792
ASSIST_PER_GAME                    1844
STEALS_PER_GAME                    1456
BLOCKS_PER_GAME                    1456
TURNOVER_PER_GAME                  1527
PERSONAL_FOULS_PER_GAME            1844
POINTS_PER_GAME                    1844
W                                  1757
L                                  1757
dtype: int64

We'll now do a Heatmap to determine which variables are mostly related with the number of Wins versus Loss

In [81]:
# Calculate the correlation matrix
correlation_matrix = team_sum_df.corr(numeric_only=True)

# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
#plt.show()

# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)

print("Correlation with Wins:")
print(correlation_with_wins)

print("\nCorrelation with Losses:")
print(correlation_with_losses)
Correlation with Wins:
W                                        1.000000
PW                                       0.968978
NET_RATING                               0.923874
MARGIN_OF_VICTORY                        0.917755
SIMPLE_RATING_SYSTEM                     0.915788
PLAYOFFS                                 0.718380
AGE                                      0.439181
OFFENSIVE_RATING                         0.358557
TRUE_SHOOTING_PERCENTAGE                 0.356146
EFFECTIEV_FIELD_GOAL_PERCENTAGE          0.322930
ATTEND                                   0.289586
ATTEND_G                                 0.203818
FREE_THROWS_PER_FIELD_GOAL_ATTEMPT       0.142542
OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE    0.093637
OFFENSIVE_REBOUND_PERCENTAGE             0.089972
FREE_THROW_Rate                          0.082298
OPPONENT_FREE_THROW_PERCENTAGE           0.077277
SEASON                                   0.035656
PACE                                     0.005594
THREE_POINTER_ATTEMPT_RATE              -0.008625
TURNOVER_PERCENTAGE                     -0.130810
OPPONENT_FREE_THROW_PERCENTAGE.1        -0.196387
DEFENSIVE_RATING                        -0.285715
OPPONENT_EFFECTIVE_FIELD_GAL            -0.396491
STRENGTH_OF_SCHEDULE                    -0.475727
PL                                      -0.773434
L                                       -0.814865
Name: W, dtype: float64

Correlation with Losses:
L                                        1.000000
PL                                       0.968569
STRENGTH_OF_SCHEDULE                     0.489078
DEFENSIVE_RATING                         0.331425
OPPONENT_EFFECTIVE_FIELD_GAL             0.312214
OPPONENT_FREE_THROW_PERCENTAGE.1         0.264494
TURNOVER_PERCENTAGE                      0.214612
SEASON                                   0.035872
PACE                                     0.031618
OPPONENT_FREE_THROW_PERCENTAGE           0.006004
OFFENSIVE_REBOUND_PERCENTAGE             0.000353
ATTEND                                  -0.092904
FREE_THROW_Rate                         -0.114637
EFFECTIEV_FIELD_GOAL_PERCENTAGE         -0.123563
THREE_POINTER_ATTEMPT_RATE              -0.131948
FREE_THROWS_PER_FIELD_GOAL_ATTEMPT      -0.146011
TRUE_SHOOTING_PERCENTAGE                -0.153538
ATTEND_G                                -0.167581
OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE   -0.193554
OFFENSIVE_RATING                        -0.310158
AGE                                     -0.444833
PLAYOFFS                                -0.692873
PW                                      -0.778654
W                                       -0.814865
SIMPLE_RATING_SYSTEM                    -0.908782
MARGIN_OF_VICTORY                       -0.912530
NET_RATING                              -0.919095
Name: L, dtype: float64
In [82]:
team_sum_df.drop(columns=['FREE_THROWS_PER_FIELD_GOAL_ATTEMPT', 'OPPONENT_DEFENSIVE_REBOUND_PERCENTAGE',
                          'OFFENSIVE_REBOUND_PERCENTAGE', 'FREE_THROW_Rate',
                          'OPPONENT_FREE_THROW_PERCENTAGE', 'PACE',
                          'ARENA', 'ATTEND', 'ATTEND_G',
                          'THREE_POINTER_ATTEMPT_RATE'], inplace=True)

print(team_sum_df.notnull().sum())
SEASON                              1845
LEAGUE                              1845
TEAM                                1845
ABBREVIATION                        1758
PLAYOFFS                            1845
AGE                                 1781
W                                   1757
L                                   1757
PW                                  1844
PL                                  1844
MARGIN_OF_VICTORY                   1844
STRENGTH_OF_SCHEDULE                1844
SIMPLE_RATING_SYSTEM                1844
OFFENSIVE_RATING                    1792
DEFENSIVE_RATING                    1792
NET_RATING                          1709
TRUE_SHOOTING_PERCENTAGE            1844
EFFECTIEV_FIELD_GOAL_PERCENTAGE     1844
TURNOVER_PERCENTAGE                 1527
OPPONENT_EFFECTIVE_FIELD_GAL        1581
OPPONENT_FREE_THROW_PERCENTAGE.1    1581
dtype: int64
In [83]:
#Drop rows where columns have just a few missing value

team_sum_df.dropna(subset=['W', 'L'], inplace=True)
print(team_sum_df.notnull().sum())
SEASON                              1757
LEAGUE                              1757
TEAM                                1757
ABBREVIATION                        1757
PLAYOFFS                            1757
AGE                                 1698
W                                   1757
L                                   1757
PW                                  1757
PL                                  1757
MARGIN_OF_VICTORY                   1757
STRENGTH_OF_SCHEDULE                1757
SIMPLE_RATING_SYSTEM                1757
OFFENSIVE_RATING                    1709
DEFENSIVE_RATING                    1709
NET_RATING                          1709
TRUE_SHOOTING_PERCENTAGE            1757
EFFECTIEV_FIELD_GOAL_PERCENTAGE     1757
TURNOVER_PERCENTAGE                 1467
OPPONENT_EFFECTIVE_FIELD_GAL        1518
OPPONENT_FREE_THROW_PERCENTAGE.1    1518
dtype: int64
In [84]:
#Impute values based on teams with the same name

imputer = SimpleImputer(strategy='mean')
# Define the columns to impute
columns_to_impute = ['AGE', 'DEFENSIVE_RATING', 'OFFENSIVE_RATING', 'NET_RATING',
                     'OPPONENT_EFFECTIVE_FIELD_GAL',
                     'OPPONENT_FREE_THROW_PERCENTAGE.1',
                     'TURNOVER_PERCENTAGE']

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')

# Iterate over each column to impute
for column in columns_to_impute:
    # Group by 'TEAM' and apply the imputer to each group
    team_sum_df[column] = team_sum_df.groupby('TEAM')[column].transform(
        lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
    )

# Print the count of non-null values after imputation
print(team_sum_df.notnull().sum())
SEASON                              1757
LEAGUE                              1757
TEAM                                1757
ABBREVIATION                        1757
PLAYOFFS                            1757
AGE                                 1736
W                                   1757
L                                   1757
PW                                  1757
PL                                  1757
MARGIN_OF_VICTORY                   1757
STRENGTH_OF_SCHEDULE                1757
SIMPLE_RATING_SYSTEM                1757
OFFENSIVE_RATING                    1738
DEFENSIVE_RATING                    1738
NET_RATING                          1738
TRUE_SHOOTING_PERCENTAGE            1757
EFFECTIEV_FIELD_GOAL_PERCENTAGE     1757
TURNOVER_PERCENTAGE                 1608
OPPONENT_EFFECTIVE_FIELD_GAL        1653
OPPONENT_FREE_THROW_PERCENTAGE.1    1653
dtype: int64
In [85]:
#Drop rows where teams did not have any data to help with imputation

team_sum_df.dropna(subset=['AGE', 'TURNOVER_PERCENTAGE', 'OPPONENT_EFFECTIVE_FIELD_GAL',
                           'OPPONENT_FREE_THROW_PERCENTAGE.1'], inplace=True)
print(team_sum_df.notnull().sum())
SEASON                              1608
LEAGUE                              1608
TEAM                                1608
ABBREVIATION                        1608
PLAYOFFS                            1608
AGE                                 1608
W                                   1608
L                                   1608
PW                                  1608
PL                                  1608
MARGIN_OF_VICTORY                   1608
STRENGTH_OF_SCHEDULE                1608
SIMPLE_RATING_SYSTEM                1608
OFFENSIVE_RATING                    1608
DEFENSIVE_RATING                    1608
NET_RATING                          1608
TRUE_SHOOTING_PERCENTAGE            1608
EFFECTIEV_FIELD_GOAL_PERCENTAGE     1608
TURNOVER_PERCENTAGE                 1608
OPPONENT_EFFECTIVE_FIELD_GAL        1608
OPPONENT_FREE_THROW_PERCENTAGE.1    1608
dtype: int64
In [86]:
print(team_totals_df.shape)
(1845, 30)
In [87]:
print(team_totals_df.notnull().sum())
SEASON                       1845
LEAGUE                       1845
TEAM                         1845
ABBREVIATION                 1758
PLAYOFFS                     1845
GAMES                        1844
MINUTES_PLAYED               1655
FIELD_GOALS_MADE             1844
FIELD_GOALS_ATTEMPTED        1844
FIELD_GOALS_PERCENTAGE       1844
THREE_POINTERS_MADE          1402
THREE_POINTERS_ATTEMPTED     1402
THREE_POINTERS_PERCENTAGE    1402
TWO_POINTERS_MADE            1844
TWO_POINTERS_ATTEMPTED       1844
TWO_POINTERS_PERCENTAGE      1844
FREE_THROWS_MADE             1844
FREE_THROWS_ATTEMPTED        1844
FREE_THROW_PERCENTAGE        1844
OFFENSIVE_REBOUND            1515
DEFENSIVE_REBOUND            1515
TOTAL_REBOUNDS               1792
ASSISTS                      1844
STEALS                       1456
BLOCKS                       1456
TURNOVERS                    1527
PERSONAL FOULS               1844
POINTS                       1844
W                            1757
L                            1757
dtype: int64
In [88]:
# Doing a Heatmap to determine which variables are mostly related with the number of Wins versus Loss, again

# Calculate the correlation matrix
correlation_matrix = team_totals_df.corr(numeric_only=True)

# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)

print("Correlation with Wins:")
print(correlation_with_wins)

print("\nCorrelation with Losses:")
print(correlation_with_losses)
Correlation with Wins:
W                            1.000000
PLAYOFFS                     0.718380
DEFENSIVE_REBOUND            0.408673
ASSISTS                      0.407251
POINTS                       0.389685
FIELD_GOALS_PERCENTAGE       0.385399
FIELD_GOALS_MADE             0.353058
BLOCKS                       0.318962
TWO_POINTERS_PERCENTAGE      0.316198
GAMES                        0.313225
MINUTES_PLAYED               0.276975
STEALS                       0.264540
FREE_THROWS_MADE             0.255571
TWO_POINTERS_MADE            0.245274
FIELD_GOALS_ATTEMPTED        0.222260
FREE_THROWS_ATTEMPTED        0.210115
TOTAL_REBOUNDS               0.203584
FREE_THROW_PERCENTAGE        0.200805
THREE_POINTERS_PERCENTAGE    0.133557
TWO_POINTERS_ATTEMPTED       0.107317
OFFENSIVE_REBOUND            0.092751
PERSONAL FOULS               0.089372
THREE_POINTERS_MADE          0.066237
TURNOVERS                    0.052525
SEASON                       0.035656
THREE_POINTERS_ATTEMPTED     0.033770
L                           -0.814865
Name: W, dtype: float64

Correlation with Losses:
L                            1.000000
PERSONAL FOULS               0.305033
GAMES                        0.295246
TURNOVERS                    0.280257
MINUTES_PLAYED               0.259580
FIELD_GOALS_ATTEMPTED        0.238983
OFFENSIVE_REBOUND            0.197767
TWO_POINTERS_ATTEMPTED       0.170492
TWO_POINTERS_MADE            0.129778
FIELD_GOALS_MADE             0.126110
POINTS                       0.105312
FREE_THROWS_ATTEMPTED        0.068892
STEALS                       0.056192
FREE_THROWS_MADE             0.046885
SEASON                       0.035872
TOTAL_REBOUNDS               0.035398
ASSISTS                      0.019720
DEFENSIVE_REBOUND           -0.007304
BLOCKS                      -0.080132
THREE_POINTERS_ATTEMPTED    -0.081842
FIELD_GOALS_PERCENTAGE      -0.101884
FREE_THROW_PERCENTAGE       -0.107084
TWO_POINTERS_PERCENTAGE     -0.116543
THREE_POINTERS_MADE         -0.117717
THREE_POINTERS_PERCENTAGE   -0.200552
PLAYOFFS                    -0.692873
W                           -0.814865
Name: L, dtype: float64
In [89]:
#Drop Columns we might not use

team_totals_df.drop(columns=['THREE_POINTERS_PERCENTAGE', 'TWO_POINTERS_ATTEMPTED',
                             'OFFENSIVE_REBOUND', 'PERSONAL FOULS', 'THREE_POINTERS_MADE',
                             'THREE_POINTERS_ATTEMPTED',
                          ], inplace=True)

print(team_totals_df.notnull().sum())
SEASON                     1845
LEAGUE                     1845
TEAM                       1845
ABBREVIATION               1758
PLAYOFFS                   1845
GAMES                      1844
MINUTES_PLAYED             1655
FIELD_GOALS_MADE           1844
FIELD_GOALS_ATTEMPTED      1844
FIELD_GOALS_PERCENTAGE     1844
TWO_POINTERS_MADE          1844
TWO_POINTERS_PERCENTAGE    1844
FREE_THROWS_MADE           1844
FREE_THROWS_ATTEMPTED      1844
FREE_THROW_PERCENTAGE      1844
DEFENSIVE_REBOUND          1515
TOTAL_REBOUNDS             1792
ASSISTS                    1844
STEALS                     1456
BLOCKS                     1456
TURNOVERS                  1527
POINTS                     1844
W                          1757
L                          1757
dtype: int64
In [90]:
#Drop rows where columns have just a few missing value
team_totals_df.dropna(subset=['W', 'L'], inplace=True)

print(team_totals_df.notnull().sum())
SEASON                     1757
LEAGUE                     1757
TEAM                       1757
ABBREVIATION               1757
PLAYOFFS                   1757
GAMES                      1757
MINUTES_PLAYED             1586
FIELD_GOALS_MADE           1757
FIELD_GOALS_ATTEMPTED      1757
FIELD_GOALS_PERCENTAGE     1757
TWO_POINTERS_MADE          1757
TWO_POINTERS_PERCENTAGE    1757
FREE_THROWS_MADE           1757
FREE_THROWS_ATTEMPTED      1757
FREE_THROW_PERCENTAGE      1757
DEFENSIVE_REBOUND          1456
TOTAL_REBOUNDS             1709
ASSISTS                    1757
STEALS                     1402
BLOCKS                     1402
TURNOVERS                  1467
POINTS                     1757
W                          1757
L                          1757
dtype: int64
In [91]:
#Impute values based on teams with the same name, and in the same season

#Impute values based on teams with the same name
imputer = SimpleImputer(strategy='mean')

# Define the columns to impute
columns_to_impute = ['MINUTES_PLAYED', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS', 'STEALS', 'BLOCKS', 'TURNOVERS']

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')

# Iterate over each column to impute
for column in columns_to_impute:
    # Group by 'TEAM' and apply the imputer to each group
    team_totals_df[column] = team_totals_df.groupby('TEAM')[column].transform(
        lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
    )

# Print the count of non-null values after imputation
print(team_totals_df.notnull().sum())
SEASON                     1757
LEAGUE                     1757
TEAM                       1757
ABBREVIATION               1757
PLAYOFFS                   1757
GAMES                      1757
MINUTES_PLAYED             1666
FIELD_GOALS_MADE           1757
FIELD_GOALS_ATTEMPTED      1757
FIELD_GOALS_PERCENTAGE     1757
TWO_POINTERS_MADE          1757
TWO_POINTERS_PERCENTAGE    1757
FREE_THROWS_MADE           1757
FREE_THROWS_ATTEMPTED      1757
FREE_THROW_PERCENTAGE      1757
DEFENSIVE_REBOUND          1605
TOTAL_REBOUNDS             1738
ASSISTS                    1757
STEALS                     1573
BLOCKS                     1573
TURNOVERS                  1608
POINTS                     1757
W                          1757
L                          1757
dtype: int64
In [92]:
#Drop rows where teams did not have any data to help with imputation

team_totals_df.dropna(subset=['MINUTES_PLAYED', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS'], inplace=True)
print(team_totals_df.notnull().sum())
SEASON                     1605
LEAGUE                     1605
TEAM                       1605
ABBREVIATION               1605
PLAYOFFS                   1605
GAMES                      1605
MINUTES_PLAYED             1605
FIELD_GOALS_MADE           1605
FIELD_GOALS_ATTEMPTED      1605
FIELD_GOALS_PERCENTAGE     1605
TWO_POINTERS_MADE          1605
TWO_POINTERS_PERCENTAGE    1605
FREE_THROWS_MADE           1605
FREE_THROWS_ATTEMPTED      1605
FREE_THROW_PERCENTAGE      1605
DEFENSIVE_REBOUND          1605
TOTAL_REBOUNDS             1605
ASSISTS                    1605
STEALS                     1573
BLOCKS                     1573
TURNOVERS                  1605
POINTS                     1605
W                          1605
L                          1605
dtype: int64
In [93]:
# Fill null values in 'STEALS' and 'BLOCKS' columns with zero
team_totals_df['STEALS'].fillna(0, inplace=True)
team_totals_df['BLOCKS'].fillna(0, inplace=True)

# Check the count of non-null values after filling null values
print(team_totals_df.notnull().sum())
SEASON                     1605
LEAGUE                     1605
TEAM                       1605
ABBREVIATION               1605
PLAYOFFS                   1605
GAMES                      1605
MINUTES_PLAYED             1605
FIELD_GOALS_MADE           1605
FIELD_GOALS_ATTEMPTED      1605
FIELD_GOALS_PERCENTAGE     1605
TWO_POINTERS_MADE          1605
TWO_POINTERS_PERCENTAGE    1605
FREE_THROWS_MADE           1605
FREE_THROWS_ATTEMPTED      1605
FREE_THROW_PERCENTAGE      1605
DEFENSIVE_REBOUND          1605
TOTAL_REBOUNDS             1605
ASSISTS                    1605
STEALS                     1605
BLOCKS                     1605
TURNOVERS                  1605
POINTS                     1605
W                          1605
L                          1605
dtype: int64
In [94]:
#Dealing with the team_Stats_Per_Game
print(team_statPG_df.shape)
(1845, 30)
In [95]:
print(team_statPG_df.notnull().sum())
SEASON                             1845
LEAGUE                             1845
TEAM                               1845
ABBREVIATION                       1758
PLAYOFFS                           1845
GAMES                              1844
mp_per_game                        1655
fg_per_game                        1844
FIELD_GOAL_AGAINST_PER_GAME        1844
FIELD_GOAL_PERCENTAGE              1844
THREE_POINTERS_PER_GAME            1402
THREE_POINTERS_AGAINST_PER_GAME    1402
THREE_POINTER_POSITION             1402
TWO_POINTERS_PER_GAME              1844
TWO_POINTER_AGAINST_PER_GAME       1844
TWO_POINTER_PERCENTAGE             1844
FREE_THROW_PER_GAME                1844
FREE_THROW_AGAINST_PER_GAME        1844
FREE_THROW_PERCENTAGE              1844
OFFENSIVE_Rebound_per_game         1515
DEFENSIVE_REBOUND_PER_GAME         1515
TOTAL_REBOUND_PER_GAME             1792
ASSIST_PER_GAME                    1844
STEALS_PER_GAME                    1456
BLOCKS_PER_GAME                    1456
TURNOVER_PER_GAME                  1527
PERSONAL_FOULS_PER_GAME            1844
POINTS_PER_GAME                    1844
W                                  1757
L                                  1757
dtype: int64
In [96]:
# Doing a Heatmap to determine which variables are mostly related with the number of
# Wins versus teh number of Loss


# Calculate the correlation matrix
correlation_matrix = team_statPG_df.corr(numeric_only=True)

# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
#plt.show()

# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)

print("Correlation with Wins:")
print(correlation_with_wins)

print("\nCorrelation with Losses:")
print(correlation_with_losses)
Correlation with Wins:
W                                  1.000000
PLAYOFFS                           0.718380
FIELD_GOAL_PERCENTAGE              0.385399
ASSIST_PER_GAME                    0.316808
TWO_POINTER_PERCENTAGE             0.316198
GAMES                              0.313225
POINTS_PER_GAME                    0.303466
fg_per_game                        0.268725
DEFENSIVE_REBOUND_PER_GAME         0.265481
BLOCKS_PER_GAME                    0.218514
FREE_THROW_PERCENTAGE              0.200805
TWO_POINTERS_PER_GAME              0.157563
STEALS_PER_GAME                    0.147886
THREE_POINTER_POSITION             0.133557
FREE_THROW_PER_GAME                0.119712
FREE_THROW_AGAINST_PER_GAME        0.066455
TOTAL_REBOUND_PER_GAME             0.066041
SEASON                             0.035656
THREE_POINTERS_PER_GAME            0.015291
mp_per_game                        0.000792
FIELD_GOAL_AGAINST_PER_GAME       -0.013688
THREE_POINTERS_AGAINST_PER_GAME   -0.018949
TWO_POINTER_AGAINST_PER_GAME      -0.019466
OFFENSIVE_Rebound_per_game        -0.021232
TURNOVER_PER_GAME                 -0.096362
PERSONAL_FOULS_PER_GAME           -0.132005
L                                 -0.814865
Name: W, dtype: float64

Correlation with Losses:
L                                  1.000000
GAMES                              0.295246
TURNOVER_PER_GAME                  0.194289
PERSONAL_FOULS_PER_GAME            0.157826
OFFENSIVE_Rebound_per_game         0.108481
TWO_POINTER_AGAINST_PER_GAME       0.057241
SEASON                             0.035872
FIELD_GOAL_AGAINST_PER_GAME        0.031538
TWO_POINTERS_PER_GAME              0.023603
mp_per_game                       -0.012973
fg_per_game                       -0.045643
FREE_THROW_AGAINST_PER_GAME       -0.079196
STEALS_PER_GAME                   -0.093570
FIELD_GOAL_PERCENTAGE             -0.101884
TOTAL_REBOUND_PER_GAME            -0.105894
FREE_THROW_PER_GAME               -0.107001
FREE_THROW_PERCENTAGE             -0.107084
POINTS_PER_GAME                   -0.115425
TWO_POINTER_PERCENTAGE            -0.116543
THREE_POINTERS_AGAINST_PER_GAME   -0.124966
ASSIST_PER_GAME                   -0.150416
THREE_POINTERS_PER_GAME           -0.157817
THREE_POINTER_POSITION            -0.200552
BLOCKS_PER_GAME                   -0.210151
DEFENSIVE_REBOUND_PER_GAME        -0.308427
PLAYOFFS                          -0.692873
W                                 -0.814865
Name: L, dtype: float64
In [97]:
team_statPG_df.dropna(subset=['W', 'L'], inplace=True)
print(team_statPG_df.notnull().sum())
SEASON                             1757
LEAGUE                             1757
TEAM                               1757
ABBREVIATION                       1757
PLAYOFFS                           1757
GAMES                              1757
mp_per_game                        1586
fg_per_game                        1757
FIELD_GOAL_AGAINST_PER_GAME        1757
FIELD_GOAL_PERCENTAGE              1757
THREE_POINTERS_PER_GAME            1348
THREE_POINTERS_AGAINST_PER_GAME    1348
THREE_POINTER_POSITION             1348
TWO_POINTERS_PER_GAME              1757
TWO_POINTER_AGAINST_PER_GAME       1757
TWO_POINTER_PERCENTAGE             1757
FREE_THROW_PER_GAME                1757
FREE_THROW_AGAINST_PER_GAME        1757
FREE_THROW_PERCENTAGE              1757
OFFENSIVE_Rebound_per_game         1456
DEFENSIVE_REBOUND_PER_GAME         1456
TOTAL_REBOUND_PER_GAME             1709
ASSIST_PER_GAME                    1757
STEALS_PER_GAME                    1402
BLOCKS_PER_GAME                    1402
TURNOVER_PER_GAME                  1467
PERSONAL_FOULS_PER_GAME            1757
POINTS_PER_GAME                    1757
W                                  1757
L                                  1757
dtype: int64
In [98]:
#Drop columns with low correlation
team_statPG_df.drop(columns=['TWO_POINTERS_PER_GAME', 'STEALS_PER_GAME',
                          'THREE_POINTER_POSITION', 'FREE_THROW_PER_GAME', 'FREE_THROW_AGAINST_PER_GAME',
                          'TOTAL_REBOUND_PER_GAME', 'THREE_POINTERS_PER_GAME',
                          'mp_per_game', 'FIELD_GOAL_AGAINST_PER_GAME', 'THREE_POINTERS_AGAINST_PER_GAME',
                          'TWO_POINTER_AGAINST_PER_GAME', 'OFFENSIVE_Rebound_per_game', 'TURNOVER_PER_GAME',
                          'PERSONAL_FOULS_PER_GAME'], inplace=True)

print(team_statPG_df.notnull().sum())
SEASON                        1757
LEAGUE                        1757
TEAM                          1757
ABBREVIATION                  1757
PLAYOFFS                      1757
GAMES                         1757
fg_per_game                   1757
FIELD_GOAL_PERCENTAGE         1757
TWO_POINTER_PERCENTAGE        1757
FREE_THROW_PERCENTAGE         1757
DEFENSIVE_REBOUND_PER_GAME    1456
ASSIST_PER_GAME               1757
BLOCKS_PER_GAME               1402
POINTS_PER_GAME               1757
W                             1757
L                             1757
dtype: int64

Here we are imputing using the mean not of teh entire dataframe but for each missing value, we impute with the mean of the corresponding team name.

In [99]:
#Impute values based on teams with the same name
imputer = SimpleImputer(strategy='mean')

# Define the columns to impute
columns_to_impute = ['DEFENSIVE_REBOUND_PER_GAME', 'BLOCKS_PER_GAME']

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')

# Iterate over each column to impute
for column in columns_to_impute:
    # Group by 'TEAM' and apply the imputer to each group
    team_statPG_df[column] = team_statPG_df.groupby('TEAM')[column].transform(
        lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
    )

# Print the count of non-null values after imputation
print(team_statPG_df.notnull().sum())
SEASON                        1757
LEAGUE                        1757
TEAM                          1757
ABBREVIATION                  1757
PLAYOFFS                      1757
GAMES                         1757
fg_per_game                   1757
FIELD_GOAL_PERCENTAGE         1757
TWO_POINTER_PERCENTAGE        1757
FREE_THROW_PERCENTAGE         1757
DEFENSIVE_REBOUND_PER_GAME    1605
ASSIST_PER_GAME               1757
BLOCKS_PER_GAME               1573
POINTS_PER_GAME               1757
W                             1757
L                             1757
dtype: int64
In [100]:
#Drop rows where teams did not have any data to help with imputation

team_statPG_df.dropna(subset=['DEFENSIVE_REBOUND_PER_GAME'], inplace=True)
print(team_statPG_df.notnull().sum())
SEASON                        1605
LEAGUE                        1605
TEAM                          1605
ABBREVIATION                  1605
PLAYOFFS                      1605
GAMES                         1605
fg_per_game                   1605
FIELD_GOAL_PERCENTAGE         1605
TWO_POINTER_PERCENTAGE        1605
FREE_THROW_PERCENTAGE         1605
DEFENSIVE_REBOUND_PER_GAME    1605
ASSIST_PER_GAME               1605
BLOCKS_PER_GAME               1573
POINTS_PER_GAME               1605
W                             1605
L                             1605
dtype: int64
In [101]:
#Drop Columns Blocks_per_game as we already have Blocks in the team_summary df

team_statPG_df.drop(columns=['BLOCKS_PER_GAME'], inplace=True)

print(team_statPG_df.notnull().sum())
SEASON                        1605
LEAGUE                        1605
TEAM                          1605
ABBREVIATION                  1605
PLAYOFFS                      1605
GAMES                         1605
fg_per_game                   1605
FIELD_GOAL_PERCENTAGE         1605
TWO_POINTER_PERCENTAGE        1605
FREE_THROW_PERCENTAGE         1605
DEFENSIVE_REBOUND_PER_GAME    1605
ASSIST_PER_GAME               1605
POINTS_PER_GAME               1605
W                             1605
L                             1605
dtype: int64

Merging the teams CSVs based on the year and teh team name (team_id is unfortunately not on all CSVs)

In [102]:
# Initial merge (full outer join) between teams_df and team_totals_df
merged_team_df = pd.merge(team_totals_df, team_sum_df, on=['TEAM', 'SEASON'], how='outer')
merged_team_df = pd.merge(merged_team_df, team_statPG_df, on=['TEAM', 'SEASON'], how='outer')
merged_team_df.head(10)
Out[102]:
SEASON LEAGUE_x TEAM ABBREVIATION_x PLAYOFFS_x GAMES_x MINUTES_PLAYED FIELD_GOALS_MADE FIELD_GOALS_ATTEMPTED FIELD_GOALS_PERCENTAGE ... GAMES_y fg_per_game FIELD_GOAL_PERCENTAGE TWO_POINTER_PERCENTAGE FREE_THROW_PERCENTAGE_y DEFENSIVE_REBOUND_PER_GAME ASSIST_PER_GAME POINTS_PER_GAME W L
0 2024 NBA Atlanta Hawks ATL False 55.0 13300.0 2397.0 5144.0 0.466 ... 55.0 43.6 0.466 0.539 0.813 31.9 26.2 121.3 24.0 31.0
1 2024 NBA Boston Celtics BOS False 55.0 13325.0 2391.0 4967.0 0.481 ... 55.0 43.5 0.481 0.572 0.808 36.6 26.2 120.7 43.0 12.0
2 2024 NBA Brooklyn Nets BRK False 54.0 13035.0 2257.0 4902.0 0.460 ... 54.0 41.8 0.460 0.525 0.759 32.8 26.8 113.4 21.0 33.0
3 2024 NBA Chicago Bulls CHI False 55.0 13400.0 2292.0 4904.0 0.467 ... 55.0 41.7 0.467 0.529 0.791 32.4 24.3 111.8 26.0 29.0
4 2024 NBA Charlotte Hornets CHO False 54.0 13010.0 2206.0 4790.0 0.461 ... 54.0 40.9 0.461 0.519 0.788 31.0 24.9 108.6 13.0 41.0
5 2024 NBA Cleveland Cavaliers CLE False 53.0 12770.0 2267.0 4698.0 0.483 ... 53.0 42.8 0.483 0.570 0.772 34.2 27.3 114.9 36.0 17.0
6 2024 NBA Dallas Mavericks DAL False 55.0 13200.0 2362.0 4937.0 0.478 ... 55.0 42.9 0.478 0.565 0.755 32.3 25.6 118.7 32.0 23.0
7 2024 NBA Denver Nuggets DEN False 55.0 13200.0 2388.0 4883.0 0.489 ... 55.0 43.4 0.489 0.554 0.756 33.0 28.6 113.9 36.0 19.0
8 2024 NBA Detroit Pistons DET False 54.0 13035.0 2281.0 4836.0 0.472 ... 54.0 42.2 0.472 0.531 0.778 32.4 26.5 112.9 8.0 46.0
9 2024 NBA Golden State Warriors GSW False 53.0 12870.0 2311.0 4871.0 0.474 ... 53.0 43.6 0.474 0.546 0.787 34.3 29.0 119.7 27.0 26.0

10 rows × 56 columns

In [103]:
#Just sanity checking
print(team_totals_df.shape)
print(team_sum_df.shape)
print(team_statPG_df.shape)
print(merged_team_df.shape)
(1605, 24)
(1608, 21)
(1605, 15)
(1608, 56)
In [104]:
merged_team_columns = merged_team_df.columns.tolist()
print(merged_team_columns)
['SEASON', 'LEAGUE_x', 'TEAM', 'ABBREVIATION_x', 'PLAYOFFS_x', 'GAMES_x', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOALS_ATTEMPTED', 'FIELD_GOALS_PERCENTAGE', 'TWO_POINTERS_MADE', 'TWO_POINTERS_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROWS_ATTEMPTED', 'FREE_THROW_PERCENTAGE_x', 'DEFENSIVE_REBOUND', 'TOTAL_REBOUNDS', 'ASSISTS', 'STEALS', 'BLOCKS', 'TURNOVERS', 'POINTS', 'W_x', 'L_x', 'LEAGUE_y', 'ABBREVIATION_y', 'PLAYOFFS_y', 'AGE', 'W_y', 'L_y', 'PW', 'PL', 'MARGIN_OF_VICTORY', 'STRENGTH_OF_SCHEDULE', 'SIMPLE_RATING_SYSTEM', 'OFFENSIVE_RATING', 'DEFENSIVE_RATING', 'NET_RATING', 'TRUE_SHOOTING_PERCENTAGE', 'EFFECTIEV_FIELD_GOAL_PERCENTAGE', 'TURNOVER_PERCENTAGE', 'OPPONENT_EFFECTIVE_FIELD_GAL', 'OPPONENT_FREE_THROW_PERCENTAGE.1', 'LEAGUE', 'ABBREVIATION', 'PLAYOFFS', 'GAMES_y', 'fg_per_game', 'FIELD_GOAL_PERCENTAGE', 'TWO_POINTER_PERCENTAGE', 'FREE_THROW_PERCENTAGE_y', 'DEFENSIVE_REBOUND_PER_GAME', 'ASSIST_PER_GAME', 'POINTS_PER_GAME', 'W', 'L']
In [105]:
print(merged_team_df.notnull().sum())
SEASON                              1608
LEAGUE_x                            1605
TEAM                                1608
ABBREVIATION_x                      1605
PLAYOFFS_x                          1605
GAMES_x                             1605
MINUTES_PLAYED                      1605
FIELD_GOALS_MADE                    1605
FIELD_GOALS_ATTEMPTED               1605
FIELD_GOALS_PERCENTAGE              1605
TWO_POINTERS_MADE                   1605
TWO_POINTERS_PERCENTAGE             1605
FREE_THROWS_MADE                    1605
FREE_THROWS_ATTEMPTED               1605
FREE_THROW_PERCENTAGE_x             1605
DEFENSIVE_REBOUND                   1605
TOTAL_REBOUNDS                      1605
ASSISTS                             1605
STEALS                              1605
BLOCKS                              1605
TURNOVERS                           1605
POINTS                              1605
W_x                                 1605
L_x                                 1605
LEAGUE_y                            1608
ABBREVIATION_y                      1608
PLAYOFFS_y                          1608
AGE                                 1608
W_y                                 1608
L_y                                 1608
PW                                  1608
PL                                  1608
MARGIN_OF_VICTORY                   1608
STRENGTH_OF_SCHEDULE                1608
SIMPLE_RATING_SYSTEM                1608
OFFENSIVE_RATING                    1608
DEFENSIVE_RATING                    1608
NET_RATING                          1608
TRUE_SHOOTING_PERCENTAGE            1608
EFFECTIEV_FIELD_GOAL_PERCENTAGE     1608
TURNOVER_PERCENTAGE                 1608
OPPONENT_EFFECTIVE_FIELD_GAL        1608
OPPONENT_FREE_THROW_PERCENTAGE.1    1608
LEAGUE                              1605
ABBREVIATION                        1605
PLAYOFFS                            1605
GAMES_y                             1605
fg_per_game                         1605
FIELD_GOAL_PERCENTAGE               1605
TWO_POINTER_PERCENTAGE              1605
FREE_THROW_PERCENTAGE_y             1605
DEFENSIVE_REBOUND_PER_GAME          1605
ASSIST_PER_GAME                     1605
POINTS_PER_GAME                     1605
W                                   1605
L                                   1605
dtype: int64
In [106]:
merged_team_df.dropna(subset=['W_x', 'L_x'], inplace=True)
print(merged_team_df.notnull().sum())
SEASON                              1605
LEAGUE_x                            1605
TEAM                                1605
ABBREVIATION_x                      1605
PLAYOFFS_x                          1605
GAMES_x                             1605
MINUTES_PLAYED                      1605
FIELD_GOALS_MADE                    1605
FIELD_GOALS_ATTEMPTED               1605
FIELD_GOALS_PERCENTAGE              1605
TWO_POINTERS_MADE                   1605
TWO_POINTERS_PERCENTAGE             1605
FREE_THROWS_MADE                    1605
FREE_THROWS_ATTEMPTED               1605
FREE_THROW_PERCENTAGE_x             1605
DEFENSIVE_REBOUND                   1605
TOTAL_REBOUNDS                      1605
ASSISTS                             1605
STEALS                              1605
BLOCKS                              1605
TURNOVERS                           1605
POINTS                              1605
W_x                                 1605
L_x                                 1605
LEAGUE_y                            1605
ABBREVIATION_y                      1605
PLAYOFFS_y                          1605
AGE                                 1605
W_y                                 1605
L_y                                 1605
PW                                  1605
PL                                  1605
MARGIN_OF_VICTORY                   1605
STRENGTH_OF_SCHEDULE                1605
SIMPLE_RATING_SYSTEM                1605
OFFENSIVE_RATING                    1605
DEFENSIVE_RATING                    1605
NET_RATING                          1605
TRUE_SHOOTING_PERCENTAGE            1605
EFFECTIEV_FIELD_GOAL_PERCENTAGE     1605
TURNOVER_PERCENTAGE                 1605
OPPONENT_EFFECTIVE_FIELD_GAL        1605
OPPONENT_FREE_THROW_PERCENTAGE.1    1605
LEAGUE                              1605
ABBREVIATION                        1605
PLAYOFFS                            1605
GAMES_y                             1605
fg_per_game                         1605
FIELD_GOAL_PERCENTAGE               1605
TWO_POINTER_PERCENTAGE              1605
FREE_THROW_PERCENTAGE_y             1605
DEFENSIVE_REBOUND_PER_GAME          1605
ASSIST_PER_GAME                     1605
POINTS_PER_GAME                     1605
W                                   1605
L                                   1605
dtype: int64
In [107]:
#Drop repeating columns
merged_team_df.drop(columns=['W_x', 'W_x',
                          'W_y', 'L_y', 'FREE_THROW_PERCENTAGE_y', 'GAMES_y', 'LEAGUE_y',
                             'ABBREVIATION_y'], inplace=True)

print(merged_team_df.notnull().sum())
SEASON                              1605
LEAGUE_x                            1605
TEAM                                1605
ABBREVIATION_x                      1605
PLAYOFFS_x                          1605
GAMES_x                             1605
MINUTES_PLAYED                      1605
FIELD_GOALS_MADE                    1605
FIELD_GOALS_ATTEMPTED               1605
FIELD_GOALS_PERCENTAGE              1605
TWO_POINTERS_MADE                   1605
TWO_POINTERS_PERCENTAGE             1605
FREE_THROWS_MADE                    1605
FREE_THROWS_ATTEMPTED               1605
FREE_THROW_PERCENTAGE_x             1605
DEFENSIVE_REBOUND                   1605
TOTAL_REBOUNDS                      1605
ASSISTS                             1605
STEALS                              1605
BLOCKS                              1605
TURNOVERS                           1605
POINTS                              1605
L_x                                 1605
PLAYOFFS_y                          1605
AGE                                 1605
PW                                  1605
PL                                  1605
MARGIN_OF_VICTORY                   1605
STRENGTH_OF_SCHEDULE                1605
SIMPLE_RATING_SYSTEM                1605
OFFENSIVE_RATING                    1605
DEFENSIVE_RATING                    1605
NET_RATING                          1605
TRUE_SHOOTING_PERCENTAGE            1605
EFFECTIEV_FIELD_GOAL_PERCENTAGE     1605
TURNOVER_PERCENTAGE                 1605
OPPONENT_EFFECTIVE_FIELD_GAL        1605
OPPONENT_FREE_THROW_PERCENTAGE.1    1605
LEAGUE                              1605
ABBREVIATION                        1605
PLAYOFFS                            1605
fg_per_game                         1605
FIELD_GOAL_PERCENTAGE               1605
TWO_POINTER_PERCENTAGE              1605
DEFENSIVE_REBOUND_PER_GAME          1605
ASSIST_PER_GAME                     1605
POINTS_PER_GAME                     1605
W                                   1605
L                                   1605
dtype: int64
In [108]:
# Doing a Heatmap to determine which variables are mostly related with the number of
# Wins versus teh number of Loss

# Calculate the correlation matrix
correlation_matrix = merged_team_df.corr(numeric_only=True)


# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()


# Sorting correlations with respect to 'W' (Wins) and 'L' (Losses)
correlation_with_wins = correlation_matrix['W'].sort_values(ascending=False)
correlation_with_losses = correlation_matrix['L'].sort_values(ascending=False)


print("Correlation with Wins:")
print(correlation_with_wins)


print("\nCorrelation with Losses:")
print(correlation_with_losses)

#Might still wanna drop TURNOVER_PERCENTAGE and FREE_THROW_PERCENTAGE_x
Correlation with Wins:
W                                   1.000000
PW                                  0.970126
MARGIN_OF_VICTORY                   0.925597
SIMPLE_RATING_SYSTEM                0.923119
NET_RATING                          0.919308
PLAYOFFS_y                          0.739580
AGE                                 0.439641
FIELD_GOAL_PERCENTAGE               0.404188
FIELD_GOALS_PERCENTAGE              0.404188
ASSISTS                             0.401628
POINTS                              0.388451
DEFENSIVE_REBOUND                   0.380249
OFFENSIVE_RATING                    0.371771
TRUE_SHOOTING_PERCENTAGE            0.353290
FIELD_GOALS_MADE                    0.343076
EFFECTIEV_FIELD_GOAL_PERCENTAGE     0.311990
ASSIST_PER_GAME                     0.300734
TWO_POINTER_PERCENTAGE              0.299908
TWO_POINTERS_PERCENTAGE             0.299908
GAMES_x                             0.289034
POINTS_PER_GAME                     0.284400
MINUTES_PLAYED                      0.273111
FREE_THROWS_MADE                    0.271898
BLOCKS                              0.264094
TOTAL_REBOUNDS                      0.259601
DEFENSIVE_REBOUND_PER_GAME          0.246202
fg_per_game                         0.241840
FREE_THROWS_ATTEMPTED               0.235512
TWO_POINTERS_MADE                   0.224873
FIELD_GOALS_ATTEMPTED               0.209035
STEALS                              0.207923
FREE_THROW_PERCENTAGE_x             0.159023
TURNOVERS                           0.037907
SEASON                             -0.041475
TURNOVER_PERCENTAGE                -0.132646
OPPONENT_FREE_THROW_PERCENTAGE.1   -0.193867
OPPONENT_EFFECTIVE_FIELD_GAL       -0.387089
DEFENSIVE_RATING                   -0.392106
STRENGTH_OF_SCHEDULE               -0.461399
PL                                 -0.800330
L_x                                -0.839122
L                                  -0.839122
Name: W, dtype: float64

Correlation with Losses:
L                                   1.000000
L_x                                 1.000000
PL                                  0.969897
STRENGTH_OF_SCHEDULE                0.475642
DEFENSIVE_RATING                    0.375201
OPPONENT_EFFECTIVE_FIELD_GAL        0.316313
GAMES_x                             0.278192
MINUTES_PLAYED                      0.253258
OPPONENT_FREE_THROW_PERCENTAGE.1    0.253139
TURNOVERS                           0.252481
FIELD_GOALS_ATTEMPTED               0.224683
TURNOVER_PERCENTAGE                 0.194887
TWO_POINTERS_MADE                   0.103275
FIELD_GOALS_MADE                    0.082420
FREE_THROWS_ATTEMPTED               0.063406
POINTS                              0.055874
STEALS                              0.039067
TOTAL_REBOUNDS                      0.037651
FREE_THROWS_MADE                    0.036751
SEASON                             -0.012133
DEFENSIVE_REBOUND                  -0.021783
ASSISTS                            -0.036554
BLOCKS                             -0.061461
fg_per_game                        -0.122568
FREE_THROW_PERCENTAGE_x            -0.141567
POINTS_PER_GAME                    -0.208133
ASSIST_PER_GAME                    -0.227411
FIELD_GOAL_PERCENTAGE              -0.247226
FIELD_GOALS_PERCENTAGE             -0.247226
TWO_POINTER_PERCENTAGE             -0.251151
TWO_POINTERS_PERCENTAGE            -0.251151
EFFECTIEV_FIELD_GOAL_PERCENTAGE    -0.264743
TRUE_SHOOTING_PERCENTAGE           -0.299309
DEFENSIVE_REBOUND_PER_GAME         -0.303271
OFFENSIVE_RATING                   -0.384010
AGE                                -0.448899
PLAYOFFS_y                         -0.692145
PW                                 -0.803704
W                                  -0.839122
NET_RATING                         -0.913882
SIMPLE_RATING_SYSTEM               -0.915752
MARGIN_OF_VICTORY                  -0.919776
Name: L, dtype: float64
In [109]:
merged_team_df.columns.tolist()
Out[109]:
['SEASON',
 'LEAGUE_x',
 'TEAM',
 'ABBREVIATION_x',
 'PLAYOFFS_x',
 'GAMES_x',
 'MINUTES_PLAYED',
 'FIELD_GOALS_MADE',
 'FIELD_GOALS_ATTEMPTED',
 'FIELD_GOALS_PERCENTAGE',
 'TWO_POINTERS_MADE',
 'TWO_POINTERS_PERCENTAGE',
 'FREE_THROWS_MADE',
 'FREE_THROWS_ATTEMPTED',
 'FREE_THROW_PERCENTAGE_x',
 'DEFENSIVE_REBOUND',
 'TOTAL_REBOUNDS',
 'ASSISTS',
 'STEALS',
 'BLOCKS',
 'TURNOVERS',
 'POINTS',
 'L_x',
 'PLAYOFFS_y',
 'AGE',
 'PW',
 'PL',
 'MARGIN_OF_VICTORY',
 'STRENGTH_OF_SCHEDULE',
 'SIMPLE_RATING_SYSTEM',
 'OFFENSIVE_RATING',
 'DEFENSIVE_RATING',
 'NET_RATING',
 'TRUE_SHOOTING_PERCENTAGE',
 'EFFECTIEV_FIELD_GOAL_PERCENTAGE',
 'TURNOVER_PERCENTAGE',
 'OPPONENT_EFFECTIVE_FIELD_GAL',
 'OPPONENT_FREE_THROW_PERCENTAGE.1',
 'LEAGUE',
 'ABBREVIATION',
 'PLAYOFFS',
 'fg_per_game',
 'FIELD_GOAL_PERCENTAGE',
 'TWO_POINTER_PERCENTAGE',
 'DEFENSIVE_REBOUND_PER_GAME',
 'ASSIST_PER_GAME',
 'POINTS_PER_GAME',
 'W',
 'L']
In [110]:
#Working with the games CSVs

game_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/game.csv")
games_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/games.csv")
# upon examining the heatmap of teh relatioon between this datasets' columns and
# the team wins is very low. We will thus not be using it.
games_details_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/games_details.csv")
<ipython-input-110-31b392da8928>:7: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  games_details_df = pd.read_csv("drive/MyDrive/FINAL PROJECT CSVs/games_details.csv")
In [111]:
print(games_details_df.dtypes)
GAME_ID                        int64
TEAM_ID                        int64
TEAM_ABBREVIATION             object
TEAM_CITY                     object
PLAYER_ID                      int64
PLAYER_NAME                   object
NICKNAME                      object
START_POSITION                object
COMMENT                       object
MINUTES                       object
FIELD_GOALS_MADE             float64
FIELD_GOALS_ATTEMPTED        float64
FIELD_GOAL_PERCENTAGE        float64
THREE_POINTERS_MADE          float64
THREE_POINTERS_ATTEMPTED     float64
THREE_POINTERS_PERCENTAGE    float64
FREE_THROWS_MADE             float64
FREE_THROWS_ATTEMPTED        float64
FREE_THROWS_PERCENTAGE       float64
OFFENSIVE_REBOUND            float64
DEFENSIVE_REBOUND            float64
TOTAL_REBOUNDS               float64
ASSITS                       float64
STEALS                       float64
BLOCKS                       float64
TURNOVERS                    float64
PERSONAL_FOULS               float64
TOTAL_POINTS                 float64
PLUS_MINUS                   float64
dtype: object
In [112]:
#There are irregularities in the minutes format.

def convert_to_mmss(time_str):
    # Check if the value is NaN
    if pd.isna(time_str):
        return None  # Return None for missing values

    # Split time string into hours, minutes, and seconds
    parts = time_str.split(':')

    # Check if the parts list has only one element (minutes)
    if len(parts) == 1:
        minutes = parts[0]
        seconds = '00'  # Set seconds to '00'
    elif len(parts) >= 2:
        minutes = parts[-2]
        seconds = parts[-1]
    else:
        return None  # Return None if the format is not as expected

    # Concatenate minutes and seconds with a colon separator
    mm_ss_format = f"{minutes}:{seconds}"

    return mm_ss_format

# Apply the function to the 'MINUTES' column
games_details_df['MINUTES'] = games_details_df['MINUTES'].apply(convert_to_mmss)

print(games_details_df.dtypes)
GAME_ID                        int64
TEAM_ID                        int64
TEAM_ABBREVIATION             object
TEAM_CITY                     object
PLAYER_ID                      int64
PLAYER_NAME                   object
NICKNAME                      object
START_POSITION                object
COMMENT                       object
MINUTES                       object
FIELD_GOALS_MADE             float64
FIELD_GOALS_ATTEMPTED        float64
FIELD_GOAL_PERCENTAGE        float64
THREE_POINTERS_MADE          float64
THREE_POINTERS_ATTEMPTED     float64
THREE_POINTERS_PERCENTAGE    float64
FREE_THROWS_MADE             float64
FREE_THROWS_ATTEMPTED        float64
FREE_THROWS_PERCENTAGE       float64
OFFENSIVE_REBOUND            float64
DEFENSIVE_REBOUND            float64
TOTAL_REBOUNDS               float64
ASSITS                       float64
STEALS                       float64
BLOCKS                       float64
TURNOVERS                    float64
PERSONAL_FOULS               float64
TOTAL_POINTS                 float64
PLUS_MINUS                   float64
dtype: object
In [113]:
print(games_df.dtypes)
GAME_DATE_EST                      object
GAME_ID                             int64
GAME_STATUS_TEXT                   object
HOME_TEAM_ID                        int64
VISITOR_TEAM_ID                     int64
SEASON                              int64
TEAM_ID_HOME                        int64
POINTS_HOME                       float64
FIELD_GOALS_PERCENTAGE_HOME       float64
FREE_THROWS_PERCENTAGE_HOME       float64
THREE_POINTERS_PERCENTAGE_HOME    float64
ASSISTS_HOME                      float64
TOTAL_REBOUNDS_HOME               float64
TEAM_ID_AWAY                        int64
POINTS_AWAY                       float64
FIELD_GOALS_PERCENTAGE_AWAY       float64
FREE_THROWS_PERCENTAGE_AWAY       float64
THREE_POINTERS_PERCENTAGE_AWAY    float64
ASSISTS_AWAY                      float64
TOTAL_REBOUNDS_AWAY               float64
HOME_TEAM_WINS                      int64
dtype: object
In [114]:
print(game_df.dtypes)
SEASON_ID                           int64
TEAM_ID_HOME                        int64
TEAM_ABBREVIATION_HOME             object
TEAM_NAME_HOME                     object
GAME_ID                             int64
GAME_DATE                          object
MATCHUP_HOME                       object
WIN_LOSS_HOME                      object
MINUTES                             int64
FIELD_GOALS_MADE_HOME             float64
FIELD_GOALS_ATTEMPTED_HOME        float64
FIELD_GOALS_PERCENTAGE_HOME       float64
THREE_POINTERS_HOME               float64
THREE_POINTERS_ATTEMPTED_HOME     float64
THREE_POINTERS_PERCENTAGE_HOME    float64
FREE_THROWS_MADE_HOME             float64
FREE_THROWS_ATTEMPTED_HOME        float64
FREE_THROWS_PERCENTAGE_HOME       float64
OFFENSIVE_REBOUND_HOME            float64
DEFENSIVE_REBOUND_HOME            float64
TOTAL_REBOUNDS_HOME               float64
ASSISTS_HOME                      float64
STEALS_HOME                       float64
BLOCKS_HOME                       float64
TURNOVERS_HOME                    float64
PERSONAL_FOULS_HOME               float64
TOTAL_POINTS_HOME                   int64
PLUS_MINUS_HOME                     int64
TEAM_ID_AWAY                        int64
TEAM_ABBREVIATION_AWAY             object
TEAM_NAME_AWAY                     object
MATCHUP_AWAY                       object
WIN_LOSS_AWAY                      object
MINUTES.1                         float64
FIELD_GOALS_MADE_AWAY             float64
FIELD_GOALS_ATTEMPTED_AWAY        float64
FIELD_GOALS_PERCENTAGE_AWAY       float64
THREE_POINTERS_WAY                float64
THREE_POINTERS_ATTEMPTED_AWAY     float64
THREE_POINTERS_PERCENTAGE_AWAY    float64
FREE_THROWS_MADE_AWAY             float64
FREE_THROWS_ATTEMPTED_AWAY        float64
FREE_THROWS_PERCENTAGE_AWAY       float64
OFFENSIVE_REBOUND_AWAY            float64
DEFENSIVE_REBOUND_AWAY            float64
TOTAL_REBOUNDS_AWAY               float64
ASSISTS_AWAY                      float64
STEALS_AWAY                       float64
BLOCKS_AWAY                       float64
TURNOVERS_AWAY                    float64
PERSONAL_FOULS_AWAY                 int64
TOTAL_POINTS_AWAY                   int64
PLUS_MINUS_AWAY                     int64
SEASON_TYPE                        object
dtype: object
In [115]:
# Do a left merge on the dataframes on GAME_ID
# This is done to assert all dataframes have a corresponding win and loss column
game_df = pd.merge(game_df, games_df[['GAME_ID', 'HOME_TEAM_WINS']], on=['GAME_ID'], how='left')

# Display the merged dataframe
print(game_df.notnull().sum())

#For some reason, game has only a few matches for home team wins
SEASON_ID                         65727
TEAM_ID_HOME                      65727
TEAM_ABBREVIATION_HOME            65727
TEAM_NAME_HOME                    65727
GAME_ID                           65727
GAME_DATE                         65727
MATCHUP_HOME                      65727
WIN_LOSS_HOME                     65725
MINUTES                           65727
FIELD_GOALS_MADE_HOME             65714
FIELD_GOALS_ATTEMPTED_HOME        50280
FIELD_GOALS_PERCENTAGE_HOME       50237
THREE_POINTERS_HOME               52509
THREE_POINTERS_ATTEMPTED_HOME     47044
THREE_POINTERS_PERCENTAGE_HOME    46653
FREE_THROWS_MADE_HOME             65711
FREE_THROWS_ATTEMPTED_HOME        62723
FREE_THROWS_PERCENTAGE_HOME       62718
OFFENSIVE_REBOUND_HOME            46791
DEFENSIVE_REBOUND_HOME            46728
TOTAL_REBOUNDS_HOME               49998
ASSISTS_HOME                      49922
STEALS_HOME                       46878
BLOCKS_HOME                       47101
TURNOVERS_HOME                    47043
PERSONAL_FOULS_HOME               62871
TOTAL_POINTS_HOME                 65727
PLUS_MINUS_HOME                   65727
TEAM_ID_AWAY                      65727
TEAM_ABBREVIATION_AWAY            65727
TEAM_NAME_AWAY                    65727
MATCHUP_AWAY                      65727
WIN_LOSS_AWAY                     65725
MINUTES.1                         65714
FIELD_GOALS_MADE_AWAY             50280
FIELD_GOALS_ATTEMPTED_AWAY        50238
FIELD_GOALS_PERCENTAGE_AWAY       52509
THREE_POINTERS_WAY                47044
THREE_POINTERS_ATTEMPTED_AWAY     46765
THREE_POINTERS_PERCENTAGE_AWAY    65714
FREE_THROWS_MADE_AWAY             62723
FREE_THROWS_ATTEMPTED_AWAY        62721
FREE_THROWS_PERCENTAGE_AWAY       46791
OFFENSIVE_REBOUND_AWAY            46729
DEFENSIVE_REBOUND_AWAY            50002
TOTAL_REBOUNDS_AWAY               49926
ASSISTS_AWAY                      46878
STEALS_AWAY                       47102
BLOCKS_AWAY                       47042
TURNOVERS_AWAY                    62876
PERSONAL_FOULS_AWAY               65727
TOTAL_POINTS_AWAY                 65727
PLUS_MINUS_AWAY                   65727
SEASON_TYPE                       65727
HOME_TEAM_WINS                    24986
dtype: int64
In [116]:
#This dataframe has been dropped
# Do a left merge on the dataframes on GAME_ID
# This is done to assert all dataframes have a corresponding win and loss column
games_details_df = pd.merge(games_details_df, games_df[['GAME_ID', 'HOME_TEAM_WINS']], on=['GAME_ID'], how='left')

# Display the merged dataframe
print(games_details_df.notnull().sum())
GAME_ID                      669560
TEAM_ID                      669560
TEAM_ABBREVIATION            669560
TEAM_CITY                    669560
PLAYER_ID                    669560
PLAYER_NAME                  669560
NICKNAME                      53037
START_POSITION               256104
COMMENT                      109881
MINUTES                      559678
FIELD_GOALS_MADE             559678
FIELD_GOALS_ATTEMPTED        559678
FIELD_GOAL_PERCENTAGE        559678
THREE_POINTERS_MADE          559678
THREE_POINTERS_ATTEMPTED     559678
THREE_POINTERS_PERCENTAGE    559678
FREE_THROWS_MADE             559678
FREE_THROWS_ATTEMPTED        559678
FREE_THROWS_PERCENTAGE       559678
OFFENSIVE_REBOUND            559678
DEFENSIVE_REBOUND            559678
TOTAL_REBOUNDS               559678
ASSITS                       559678
STEALS                       559678
BLOCKS                       559678
TURNOVERS                    559678
PERSONAL_FOULS               559678
TOTAL_POINTS                 559678
PLUS_MINUS                   536017
HOME_TEAM_WINS               669560
dtype: int64
In [117]:
# Doing a Heatmap to determine which variables are mostly related with the number of
# Home wins

# Calculate the correlation matrix
correlation_matrix = games_df.corr(numeric_only=True)


# Plotting the heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()


# Sorting correlations with respect to HOME_TEAM_WINS
correlation_with_wins = correlation_matrix['HOME_TEAM_WINS'].sort_values(ascending=False)


print("Correlation with HOME_TEAM_WINS:")
print(correlation_with_wins)
Correlation with HOME_TEAM_WINS:
HOME_TEAM_WINS                    1.000000
FIELD_GOALS_PERCENTAGE_HOME       0.432700
POINTS_HOME                       0.395810
ASSISTS_HOME                      0.301750
THREE_POINTERS_PERCENTAGE_HOME    0.301669
TOTAL_REBOUNDS_HOME               0.245884
FREE_THROWS_PERCENTAGE_HOME       0.093394
VISITOR_TEAM_ID                   0.032002
TEAM_ID_AWAY                      0.032002
GAME_ID                           0.030804
TEAM_ID_HOME                     -0.026123
HOME_TEAM_ID                     -0.026123
SEASON                           -0.027849
FREE_THROWS_PERCENTAGE_AWAY      -0.108255
TOTAL_REBOUNDS_AWAY              -0.254746
THREE_POINTERS_PERCENTAGE_AWAY   -0.291838
ASSISTS_AWAY                     -0.295013
POINTS_AWAY                      -0.414452
FIELD_GOALS_PERCENTAGE_AWAY      -0.436201
Name: HOME_TEAM_WINS, dtype: float64
In [118]:
#drop columns with a low correlation coefficient that is neither very positive nor very negative
games_df.drop(columns=['FREE_THROWS_PERCENTAGE_HOME', 'VISITOR_TEAM_ID',
                               'TEAM_ID_AWAY', 'FREE_THROWS_PERCENTAGE_AWAY',
                               ], inplace=True)
print(games_details_df.dtypes)
GAME_ID                        int64
TEAM_ID                        int64
TEAM_ABBREVIATION             object
TEAM_CITY                     object
PLAYER_ID                      int64
PLAYER_NAME                   object
NICKNAME                      object
START_POSITION                object
COMMENT                       object
MINUTES                       object
FIELD_GOALS_MADE             float64
FIELD_GOALS_ATTEMPTED        float64
FIELD_GOAL_PERCENTAGE        float64
THREE_POINTERS_MADE          float64
THREE_POINTERS_ATTEMPTED     float64
THREE_POINTERS_PERCENTAGE    float64
FREE_THROWS_MADE             float64
FREE_THROWS_ATTEMPTED        float64
FREE_THROWS_PERCENTAGE       float64
OFFENSIVE_REBOUND            float64
DEFENSIVE_REBOUND            float64
TOTAL_REBOUNDS               float64
ASSITS                       float64
STEALS                       float64
BLOCKS                       float64
TURNOVERS                    float64
PERSONAL_FOULS               float64
TOTAL_POINTS                 float64
PLUS_MINUS                   float64
HOME_TEAM_WINS                 int64
dtype: object
In [119]:
print(games_df.notnull().sum())
GAME_DATE_EST                     26651
GAME_ID                           26651
GAME_STATUS_TEXT                  26651
HOME_TEAM_ID                      26651
SEASON                            26651
TEAM_ID_HOME                      26651
POINTS_HOME                       26552
FIELD_GOALS_PERCENTAGE_HOME       26552
THREE_POINTERS_PERCENTAGE_HOME    26552
ASSISTS_HOME                      26552
TOTAL_REBOUNDS_HOME               26552
POINTS_AWAY                       26552
FIELD_GOALS_PERCENTAGE_AWAY       26552
THREE_POINTERS_PERCENTAGE_AWAY    26552
ASSISTS_AWAY                      26552
TOTAL_REBOUNDS_AWAY               26552
HOME_TEAM_WINS                    26651
dtype: int64
In [120]:
#Drop rows with no data

games_df.dropna(subset=['HOME_TEAM_WINS'], inplace=True)
print(games_df.notnull().sum())
GAME_DATE_EST                     26651
GAME_ID                           26651
GAME_STATUS_TEXT                  26651
HOME_TEAM_ID                      26651
SEASON                            26651
TEAM_ID_HOME                      26651
POINTS_HOME                       26552
FIELD_GOALS_PERCENTAGE_HOME       26552
THREE_POINTERS_PERCENTAGE_HOME    26552
ASSISTS_HOME                      26552
TOTAL_REBOUNDS_HOME               26552
POINTS_AWAY                       26552
FIELD_GOALS_PERCENTAGE_AWAY       26552
THREE_POINTERS_PERCENTAGE_AWAY    26552
ASSISTS_AWAY                      26552
TOTAL_REBOUNDS_AWAY               26552
HOME_TEAM_WINS                    26651
dtype: int64
In [121]:
#Dealing with game_df
# Calculate the correlation between each independent variable and the dependent variable
correlation = game_df.corr(numeric_only=True)['HOME_TEAM_WINS'].abs().sort_values(ascending=False)

print(correlation)
HOME_TEAM_WINS                    1.000000
TOTAL_POINTS_AWAY                 0.800786
PLUS_MINUS_HOME                   0.800786
FIELD_GOALS_ATTEMPTED_AWAY        0.435969
FIELD_GOALS_PERCENTAGE_HOME       0.432477
PERSONAL_FOULS_AWAY               0.412932
TOTAL_POINTS_HOME                 0.395754
MINUTES.1                         0.346459
OFFENSIVE_REBOUND_AWAY            0.338537
FIELD_GOALS_MADE_HOME             0.334366
DEFENSIVE_REBOUND_HOME            0.311751
THREE_POINTERS_PERCENTAGE_HOME    0.303110
ASSISTS_HOME                      0.303104
THREE_POINTERS_ATTEMPTED_AWAY     0.293744
TOTAL_REBOUNDS_AWAY               0.293129
DEFENSIVE_REBOUND_AWAY            0.255121
TOTAL_REBOUNDS_HOME               0.249084
FIELD_GOALS_PERCENTAGE_AWAY       0.209125
THREE_POINTERS_HOME               0.159393
BLOCKS_HOME                       0.154001
THREE_POINTERS_PERCENTAGE_AWAY    0.152171
STEALS_HOME                       0.149421
FREE_THROWS_MADE_HOME             0.146339
STEALS_AWAY                       0.133931
ASSISTS_AWAY                      0.122795
FREE_THROWS_ATTEMPTED_HOME        0.117824
FREE_THROWS_MADE_AWAY             0.117455
BLOCKS_AWAY                       0.116362
FREE_THROWS_ATTEMPTED_AWAY        0.112495
PERSONAL_FOULS_HOME               0.110610
TURNOVERS_AWAY                    0.099950
FREE_THROWS_PERCENTAGE_HOME       0.094541
TURNOVERS_HOME                    0.086109
FIELD_GOALS_ATTEMPTED_HOME        0.053229
FREE_THROWS_PERCENTAGE_AWAY       0.051830
THREE_POINTERS_WAY                0.041980
FIELD_GOALS_MADE_AWAY             0.032502
TEAM_ID_AWAY                      0.029386
MINUTES                           0.028381
PLUS_MINUS_AWAY                   0.027812
TEAM_ID_HOME                      0.027183
THREE_POINTERS_ATTEMPTED_HOME     0.026635
SEASON_ID                         0.022134
OFFENSIVE_REBOUND_HOME            0.021336
GAME_ID                           0.018529
Name: HOME_TEAM_WINS, dtype: float64
In [122]:
game_df.columns.tolist()
Out[122]:
['SEASON_ID',
 'TEAM_ID_HOME',
 'TEAM_ABBREVIATION_HOME',
 'TEAM_NAME_HOME',
 'GAME_ID',
 'GAME_DATE',
 'MATCHUP_HOME',
 'WIN_LOSS_HOME',
 'MINUTES',
 'FIELD_GOALS_MADE_HOME',
 'FIELD_GOALS_ATTEMPTED_HOME',
 'FIELD_GOALS_PERCENTAGE_HOME',
 'THREE_POINTERS_HOME',
 'THREE_POINTERS_ATTEMPTED_HOME',
 'THREE_POINTERS_PERCENTAGE_HOME',
 'FREE_THROWS_MADE_HOME',
 'FREE_THROWS_ATTEMPTED_HOME',
 'FREE_THROWS_PERCENTAGE_HOME',
 'OFFENSIVE_REBOUND_HOME',
 'DEFENSIVE_REBOUND_HOME',
 'TOTAL_REBOUNDS_HOME',
 'ASSISTS_HOME',
 'STEALS_HOME',
 'BLOCKS_HOME',
 'TURNOVERS_HOME',
 'PERSONAL_FOULS_HOME',
 'TOTAL_POINTS_HOME',
 'PLUS_MINUS_HOME',
 'TEAM_ID_AWAY',
 'TEAM_ABBREVIATION_AWAY',
 'TEAM_NAME_AWAY',
 'MATCHUP_AWAY',
 'WIN_LOSS_AWAY',
 'MINUTES.1',
 'FIELD_GOALS_MADE_AWAY',
 'FIELD_GOALS_ATTEMPTED_AWAY',
 'FIELD_GOALS_PERCENTAGE_AWAY',
 'THREE_POINTERS_WAY',
 'THREE_POINTERS_ATTEMPTED_AWAY',
 'THREE_POINTERS_PERCENTAGE_AWAY',
 'FREE_THROWS_MADE_AWAY',
 'FREE_THROWS_ATTEMPTED_AWAY',
 'FREE_THROWS_PERCENTAGE_AWAY',
 'OFFENSIVE_REBOUND_AWAY',
 'DEFENSIVE_REBOUND_AWAY',
 'TOTAL_REBOUNDS_AWAY',
 'ASSISTS_AWAY',
 'STEALS_AWAY',
 'BLOCKS_AWAY',
 'TURNOVERS_AWAY',
 'PERSONAL_FOULS_AWAY',
 'TOTAL_POINTS_AWAY',
 'PLUS_MINUS_AWAY',
 'SEASON_TYPE',
 'HOME_TEAM_WINS']

Finding the correlation when home win is positive (W). We are doing this given that game_df has more data (rows) than we have for the WIN_LOSS_HOME. Thus assuming based on just that correlation might be father from accurate

In [123]:
# Replace 'W' with 1 and 'F' with 0 in the 'WIN_LOSS_HOME' column
game_df['WIN_LOSS_HOME'] = game_df['WIN_LOSS_HOME'].replace({'W': 1, 'L': 0})

# Drop rows where 'WIN_LOSS_HOME' is NaN
game_df = game_df.dropna(subset=['WIN_LOSS_HOME'])

# Convert the column to integer type
game_df['WIN_LOSS_HOME'] = game_df['WIN_LOSS_HOME'].astype(int)

# Calculate the correlation between 'WIN_LOSS_HOME' and each other column
#correlation_with_win = game_df.corr()['WIN_LOSS_HOME'].abs().sort_values(ascending=False)

# Display the correlation coefficients
game_df.head(10)
<ipython-input-123-c0b1ed124902>:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  game_df['WIN_LOSS_HOME'] = game_df['WIN_LOSS_HOME'].astype(int)
Out[123]:
SEASON_ID TEAM_ID_HOME TEAM_ABBREVIATION_HOME TEAM_NAME_HOME GAME_ID GAME_DATE MATCHUP_HOME WIN_LOSS_HOME MINUTES FIELD_GOALS_MADE_HOME ... TOTAL_REBOUNDS_AWAY ASSISTS_AWAY STEALS_AWAY BLOCKS_AWAY TURNOVERS_AWAY PERSONAL_FOULS_AWAY TOTAL_POINTS_AWAY PLUS_MINUS_AWAY SEASON_TYPE HOME_TEAM_WINS
0 21946 1610610035 HUS Toronto Huskies 24600001 11/1/1946 0:00 HUS vs. NYK 0 0 25.0 ... NaN NaN NaN NaN NaN 68 2 0 Regular Season NaN
1 21946 1610610034 BOM St. Louis Bombers 24600003 11/2/1946 0:00 BOM vs. PIT 1 0 20.0 ... NaN NaN NaN NaN 25.0 51 -5 0 Regular Season NaN
2 21946 1610610032 PRO Providence Steamrollers 24600002 11/2/1946 0:00 PRO vs. BOS 1 0 21.0 ... NaN NaN NaN NaN NaN 53 -6 0 Regular Season NaN
3 21946 1610610025 CHS Chicago Stags 24600004 11/2/1946 0:00 CHS vs. NYK 1 0 21.0 ... NaN NaN NaN NaN 22.0 47 -16 0 Regular Season NaN
4 21946 1610610028 DEF Detroit Falcons 24600005 11/2/1946 0:00 DEF vs. WAS 0 0 10.0 ... NaN NaN NaN NaN NaN 50 17 0 Regular Season NaN
5 21946 1610610026 CLR Cleveland Rebels 24600006 11/3/1946 0:00 CLR vs. HUS 1 0 24.0 ... NaN NaN NaN NaN NaN 60 -11 0 Regular Season NaN
6 21946 1610610031 PIT Pittsburgh Ironmen 24600007 11/4/1946 0:00 PIT vs. WAS 0 0 19.0 ... NaN NaN NaN NaN NaN 71 15 0 Regular Season NaN
7 21946 1610612738 BOS Boston Celtics 24600008 11/5/1946 0:00 BOS vs. CHS 0 0 23.0 ... NaN NaN NaN NaN NaN 57 2 0 Regular Season NaN
8 21946 1610610028 DEF Detroit Falcons 24600009 11/5/1946 0:00 DEF vs. BOM 0 0 18.0 ... NaN NaN NaN NaN NaN 53 4 0 Regular Season NaN
9 21946 1610610032 PRO Providence Steamrollers 24600011 11/7/1946 0:00 PRO vs. CHS 1 0 31.0 ... NaN NaN NaN NaN 14.0 65 -8 0 Regular Season NaN

10 rows × 55 columns

In [124]:
# Calculate the correlation between 'WIN_LOSS_HOME' and each other column
correlation_with_win = game_df.corr(numeric_only=True)['WIN_LOSS_HOME'].abs().sort_values(ascending=False)

# Display the correlation coefficients
print(correlation_with_win)
WIN_LOSS_HOME                     1.000000
HOME_TEAM_WINS                    0.999917
TOTAL_POINTS_AWAY                 0.791378
PLUS_MINUS_HOME                   0.791378
FIELD_GOALS_PERCENTAGE_HOME       0.427858
FIELD_GOALS_ATTEMPTED_AWAY        0.397353
TOTAL_POINTS_HOME                 0.360849
PERSONAL_FOULS_AWAY               0.349415
OFFENSIVE_REBOUND_AWAY            0.342376
ASSISTS_HOME                      0.309796
DEFENSIVE_REBOUND_HOME            0.305064
FIELD_GOALS_MADE_HOME             0.287747
MINUTES.1                         0.267723
TOTAL_REBOUNDS_AWAY               0.258721
DEFENSIVE_REBOUND_AWAY            0.231091
THREE_POINTERS_ATTEMPTED_AWAY     0.230134
TOTAL_REBOUNDS_HOME               0.230132
THREE_POINTERS_PERCENTAGE_HOME    0.215919
FREE_THROWS_MADE_HOME             0.186840
STEALS_HOME                       0.164890
BLOCKS_HOME                       0.156968
FREE_THROWS_ATTEMPTED_HOME        0.155304
FIELD_GOALS_PERCENTAGE_AWAY       0.149945
THREE_POINTERS_PERCENTAGE_AWAY    0.135328
STEALS_AWAY                       0.135079
TURNOVERS_AWAY                    0.133878
BLOCKS_AWAY                       0.133513
ASSISTS_AWAY                      0.113065
FREE_THROWS_ATTEMPTED_AWAY        0.109990
FREE_THROWS_MADE_AWAY             0.109458
FREE_THROWS_PERCENTAGE_HOME       0.101418
PERSONAL_FOULS_HOME               0.093957
TURNOVERS_HOME                    0.073045
FREE_THROWS_PERCENTAGE_AWAY       0.071782
THREE_POINTERS_ATTEMPTED_HOME     0.066632
FIELD_GOALS_MADE_AWAY             0.064048
THREE_POINTERS_HOME               0.055654
MINUTES                           0.054040
PLUS_MINUS_AWAY                   0.046754
FIELD_GOALS_ATTEMPTED_HOME        0.041630
THREE_POINTERS_WAY                0.036150
GAME_ID                           0.033467
TEAM_ID_AWAY                      0.021365
OFFENSIVE_REBOUND_HOME            0.019098
SEASON_ID                         0.014567
TEAM_ID_HOME                      0.014155
Name: WIN_LOSS_HOME, dtype: float64
In [125]:
#Drop columns that might not help us as much

game_df.drop(columns=['OFFENSIVE_REBOUND_HOME', 'TEAM_ID_HOME',
                               'TEAM_ID_AWAY', 'THREE_POINTERS_WAY',
                      'FIELD_GOALS_ATTEMPTED_HOME', 'PLUS_MINUS_AWAY',
                      'MINUTES', 'THREE_POINTERS_HOME', 'FIELD_GOALS_MADE_AWAY',
                      'THREE_POINTERS_ATTEMPTED_HOME', 'FREE_THROWS_PERCENTAGE_AWAY',
                      'TURNOVERS_HOME', 'PERSONAL_FOULS_HOME', 'FREE_THROWS_PERCENTAGE_HOME',
                      'FREE_THROWS_MADE_HOME', 'STEALS_HOME', 'BLOCKS_HOME',
                      'FREE_THROWS_ATTEMPTED_HOME', 'FIELD_GOALS_PERCENTAGE_AWAY',
                      'THREE_POINTERS_PERCENTAGE_AWAY', 'STEALS_AWAY', 'TURNOVERS_AWAY',
                      'BLOCKS_AWAY', 'ASSISTS_AWAY', 'FREE_THROWS_ATTEMPTED_AWAY',
                      'FREE_THROWS_MADE_AWAY'], inplace=True)
print(game_df.notnull().sum())
SEASON_ID                         65725
TEAM_ABBREVIATION_HOME            65725
TEAM_NAME_HOME                    65725
GAME_ID                           65725
GAME_DATE                         65725
MATCHUP_HOME                      65725
WIN_LOSS_HOME                     65725
FIELD_GOALS_MADE_HOME             65712
FIELD_GOALS_PERCENTAGE_HOME       50235
THREE_POINTERS_PERCENTAGE_HOME    46651
DEFENSIVE_REBOUND_HOME            46726
TOTAL_REBOUNDS_HOME               49996
ASSISTS_HOME                      49920
TOTAL_POINTS_HOME                 65725
PLUS_MINUS_HOME                   65725
TEAM_ABBREVIATION_AWAY            65725
TEAM_NAME_AWAY                    65725
MATCHUP_AWAY                      65725
WIN_LOSS_AWAY                     65725
MINUTES.1                         65712
FIELD_GOALS_ATTEMPTED_AWAY        50236
THREE_POINTERS_ATTEMPTED_AWAY     46763
OFFENSIVE_REBOUND_AWAY            46727
DEFENSIVE_REBOUND_AWAY            50000
TOTAL_REBOUNDS_AWAY               49924
PERSONAL_FOULS_AWAY               65725
TOTAL_POINTS_AWAY                 65725
SEASON_TYPE                       65725
HOME_TEAM_WINS                    24984
dtype: int64
In [126]:
#Impute values based on teams with the same name, and in the same season

#Impute values based on teams with the same name
imputer = SimpleImputer(strategy='mean')

# Define the columns to impute
columns_to_impute = ['FIELD_GOALS_PERCENTAGE_HOME', 'THREE_POINTERS_PERCENTAGE_HOME',
                     'DEFENSIVE_REBOUND_HOME', 'TOTAL_REBOUNDS_HOME', 'ASSISTS_HOME',
                     'MINUTES.1', 'FIELD_GOALS_ATTEMPTED_AWAY', 'THREE_POINTERS_ATTEMPTED_AWAY',
                     'OFFENSIVE_REBOUND_AWAY', 'DEFENSIVE_REBOUND_AWAY', 'TOTAL_REBOUNDS_AWAY']

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')

# Iterate over each column to impute
for column in columns_to_impute:
    # Group by 'TEAM' and apply the imputer to each group
    game_df[column] = game_df.groupby('TEAM_NAME_HOME')[column].transform(
        lambda x: imputer.fit_transform(x.values.reshape(-1, 1)).ravel() if x.notnull().any() else x
    )

# Print the count of non-null values after imputation
print(game_df.notnull().sum())
SEASON_ID                         65725
TEAM_ABBREVIATION_HOME            65725
TEAM_NAME_HOME                    65725
GAME_ID                           65725
GAME_DATE                         65725
MATCHUP_HOME                      65725
WIN_LOSS_HOME                     65725
FIELD_GOALS_MADE_HOME             65712
FIELD_GOALS_PERCENTAGE_HOME       64536
THREE_POINTERS_PERCENTAGE_HOME    60100
DEFENSIVE_REBOUND_HOME            59761
TOTAL_REBOUNDS_HOME               63651
ASSISTS_HOME                      63618
TOTAL_POINTS_HOME                 65725
PLUS_MINUS_HOME                   65725
TEAM_ABBREVIATION_AWAY            65725
TEAM_NAME_AWAY                    65725
MATCHUP_AWAY                      65725
WIN_LOSS_AWAY                     65725
MINUTES.1                         65725
FIELD_GOALS_ATTEMPTED_AWAY        64536
THREE_POINTERS_ATTEMPTED_AWAY     60100
OFFENSIVE_REBOUND_AWAY            59761
DEFENSIVE_REBOUND_AWAY            63651
TOTAL_REBOUNDS_AWAY               63618
PERSONAL_FOULS_AWAY               65725
TOTAL_POINTS_AWAY                 65725
SEASON_TYPE                       65725
HOME_TEAM_WINS                    24984
dtype: int64
In [127]:
#dropping all rows where some values are null and also dropping teh home_team_win
# colukn as it migt not be too beneficial

# Define the list of columns to check for null values
columns_to_check = ['OFFENSIVE_REBOUND_AWAY', 'DEFENSIVE_REBOUND_HOME',
                    'FIELD_GOALS_PERCENTAGE_HOME']

# Drop rows where any of the specified columns have null values
game_df.dropna(subset=columns_to_check, inplace=True)

print(game_df.notnull().sum())
SEASON_ID                         59761
TEAM_ABBREVIATION_HOME            59761
TEAM_NAME_HOME                    59761
GAME_ID                           59761
GAME_DATE                         59761
MATCHUP_HOME                      59761
WIN_LOSS_HOME                     59761
FIELD_GOALS_MADE_HOME             59760
FIELD_GOALS_PERCENTAGE_HOME       59761
THREE_POINTERS_PERCENTAGE_HOME    59761
DEFENSIVE_REBOUND_HOME            59761
TOTAL_REBOUNDS_HOME               59761
ASSISTS_HOME                      59761
TOTAL_POINTS_HOME                 59761
PLUS_MINUS_HOME                   59761
TEAM_ABBREVIATION_AWAY            59761
TEAM_NAME_AWAY                    59761
MATCHUP_AWAY                      59761
WIN_LOSS_AWAY                     59761
MINUTES.1                         59761
FIELD_GOALS_ATTEMPTED_AWAY        59761
THREE_POINTERS_ATTEMPTED_AWAY     59761
OFFENSIVE_REBOUND_AWAY            59761
DEFENSIVE_REBOUND_AWAY            59761
TOTAL_REBOUNDS_AWAY               59761
PERSONAL_FOULS_AWAY               59761
TOTAL_POINTS_AWAY                 59761
SEASON_TYPE                       59761
HOME_TEAM_WINS                    24984
dtype: int64

We will now move forward with the Data exploration and Summary Statistics

IV - Exploratory Data Analysis¶

Creating a Composite Score and Utilizing Linear Regression for NBA Championship Prediction

In the pursuit of accurately predicting the next NBA championship winner, a crucial step involved the creation of a composite score derived from columns that exhibited the highest correlation with a team's success. This approach aimed to distill the multifaceted dynamics of basketball into a singular metric, facilitating a more comprehensive assessment of team performance. By amalgamating key indicators such as points per game, field goal percentage, defensive rebounds, and turnovers, among others, into a unified composite score, the model could encapsulate the essence of team proficiency across various facets of the game.

The rationale behind employing linear regression with this newly constructed composite score lies in its ability to delineate and quantify the relationship between predictor variables and the target variable – in this case, the likelihood of a team clinching the NBA championship. Linear regression offered a structured framework to analyze how changes in the composite score corresponded to variations in championship success probabilities. Leveraging this statistical technique allowed for the identification of patterns and trends within the data, enabling more informed predictions regarding potential championship outcomes. Moreover, the simplicity and interpretability of linear regression facilitated clear insights into the relative significance of different components comprising the composite score, aiding in the refinement and optimization of the prediction model.

In essence, the integration of a composite score and the application of linear regression represent a strategic approach to enhance the predictive efficacy of the NBA championship forecasting model. By amalgamating relevant metrics and employing a robust statistical methodology, the endeavor seeks to unravel the intricacies of basketball dynamics and furnish valuable insights into the determinants of championship success.

Note: For all Linear regression, the Null Hypothesis (H0): The coefficient of 'Combined_Metric' is equal to zero. There is no linear relationship between 'Combined_Metric' and the number of wins ('W').

In [128]:
# Select only numeric columns, excluding 'W' and 'L' for correlation calculation
numeric_cols = merged_team_df.select_dtypes(include=np.number).drop(['W', 'L'], axis=1)

# Calculate correlation with 'W'
correlation_with_w = numeric_cols.apply(lambda x: x.corr(merged_team_df['W']))

# Standardize numeric variables
normalized_variables = (numeric_cols - numeric_cols.mean()) / numeric_cols.std()

# Calculate the composite score, taking into account the direction of correlation
merged_team_df['Composite Score'] = normalized_variables.multiply(correlation_with_w, axis=1).sum(axis=1)
In [129]:
merged_team_df['Composite Score']
Out[129]:
0       -7.188061
1        9.512361
2       -9.151929
3       -6.050632
4      -19.029235
          ...    
1600   -21.047678
1601   -23.873765
1602   -29.726828
1603   -23.842757
1604   -13.805353
Name: Composite Score, Length: 1605, dtype: float64

We will now be using linear regrsesion to undestand te realtionship between the "Combined Metric" and the number of wins in the 'W'column

In [130]:
X = merged_team_df['Composite Score']  # Independent variable
y = merged_team_df['W']  # Dependent variable
In [131]:
# Add a constant term to the independent variable. This is necessary for
# estimating the intercept (the value of 'y' when 'x' is zero) in the linear regression model.

X = sm.add_constant(X)
In [132]:
#fit the linear regression model using the OLS (ordinary least squares) method from statsmodels.

model = sm.OLS(y, X).fit()
In [133]:
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.850
Model:                            OLS   Adj. R-squared:                  0.850
Method:                 Least Squares   F-statistic:                     9071.
Date:                Wed, 08 May 2024   Prob (F-statistic):               0.00
Time:                        03:10:46   Log-Likelihood:                -4819.4
No. Observations:                1605   AIC:                             9643.
Df Residuals:                    1603   BIC:                             9654.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              39.9745      0.122    328.409      0.000      39.736      40.213
Composite Score     1.3485      0.014     95.240      0.000       1.321       1.376
==============================================================================
Omnibus:                       93.333   Durbin-Watson:                   1.140
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              168.512
Skew:                           0.426   Prob(JB):                     2.56e-37
Kurtosis:                       4.339   Cond. No.                         8.60
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation and Conclusion:¶

The new linear regression analysis with the updated combined metric ('Composite Score') yielded the following results:

  • R-squared (R²): The R-squared value is 0.850, indicating that approximately 85.0% of the variance in the number of wins ('W') can be explained by the 'Composite Score'. This suggests a strong relationship between the combined metric and team success.

  • Coefficient (coef) and its p-value: The coefficient of the 'Composite Score' is 1.3485, indicating that for each one-unit increase in the 'Composite Score', there is an expected increase of approximately 1.3485 units in the number of wins ('W'). The p-value associated with this coefficient is very small (p < 0.001), indicating that the coefficient is statistically significant.

  • Intercept (const) and its p-value: The intercept term is 39.9745, representing the expected number of wins when the 'Composite Score' is zero. The intercept is also statistically significant with a p-value of 0.000.

  • F-statistic and its p-value: The F-statistic tests the overall significance of the regression model. The probability associated with the F-statistic is 0.00, indicating that the regression model as a whole is statistically significant.

Conclusion:¶

The results of the linear regression analysis indicate that the 'Composite Score' is a highly significant predictor of the number of wins ('W'). The strong R-squared value suggests that the 'Composite Score' explains a substantial portion of the variability in team success. Each unit increase in the 'Composite Score' is associated with a significant increase in the number of wins.

Therefore, based on this analysis, it can be concluded that the 'Composite Score' derived from the combination of metrics provides a robust and effective measure of team performance, with a strong predictive relationship with the number of wins. This underscores the importance of considering multiple performance indicators and refining the combined metric to better capture the factors driving team success.

In [134]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_team_df, x='Composite Score', y='W', color='blue', alpha=0.5)
sns.regplot(data=merged_team_df, x='Composite Score', y='W', scatter=False, color='red', line_kws={'linewidth': 2})
plt.title('Relationship between Composite Score and Number of Wins')
plt.xlabel('Composite Score')
plt.ylabel('Number of Wins')
plt.grid(True)
plt.show()

We will now be doing descriptive statistics on the 'Total Points Home' column from the game_df

In [135]:
column_name = 'TOTAL_POINTS_HOME'

# Calculate descriptive statistics
mean_value = game_df[column_name].mean()
median_value = game_df[column_name].median()
std_dev = game_df[column_name].std()
min_value = game_df[column_name].min()
max_value = game_df[column_name].max()

# Print descriptive statistics
print("Descriptive Statistics for", column_name)
print("Mean:", mean_value)
print("Median:", median_value)
print("Standard Deviation:", std_dev)
print("Minimum:", min_value)
print("Maximum:", max_value)
Descriptive Statistics for TOTAL_POINTS_HOME
Mean: 105.23513662756648
Median: 105.0
Standard Deviation: 14.060249885493763
Minimum: 45
Maximum: 192

Based on the descriptive statistics for the variable 'TOTAL_POINTS_HOME' in the game_df dataframe:

  1. Mean (Average): The mean total points scored by the home teams is approximately 105.23 points per game. This value represents the central tendency of the data and gives an indication of the typical scoring performance of home teams.

  2. Median: The median total points scored by the home teams is 105.0 points per game. Since the median is close to the mean, it suggests that the distribution of total points scored by home teams is approximately symmetrical.

  3. Standard Deviation: The standard deviation of approximately 14.06 indicates the average amount of variability or dispersion in the total points scored by home teams around the mean. A higher standard deviation suggests greater variability in scoring performance among home teams.

  4. Minimum and Maximum: The minimum total points scored by a home team in a game is 45, while the maximum total points scored is 192. This range of 147 points reflects the variability in scoring performances observed across different games.

Conclusion: Based on these descriptive statistics, we can conclude that home teams in the dataset tend to score an average of around 105 points per game, with a moderate level of variability around this average. The distribution of total points scored appears to be relatively symmetrical, as indicated by the similarity between the mean and median values. However, the variability in scoring performances among home teams, as evidenced by the standard deviation and range, suggests that some games may have significantly higher or lower scoring outcomes compared to the average.

In [136]:
total_points_home = game_df['TOTAL_POINTS_HOME']

# Plot histogram
plt.figure(figsize=(10, 6))
plt.hist(total_points_home, bins=20, color='skyblue', edgecolor='black', alpha=0.7)

# Add mean, median, and standard deviation error bars
mean_value = total_points_home.mean()
median_value = total_points_home.median()
std_dev = total_points_home.std()

plt.axvline(mean_value, color='red', linestyle='dashed', linewidth=1, label=f'Mean: {mean_value:.2f}')
plt.axvline(median_value, color='green', linestyle='dashed', linewidth=1, label=f'Median: {median_value:.2f}')
plt.errorbar(mean_value, 500, xerr=std_dev, fmt='o', color='red', label=f'Standard Deviation: {std_dev:.2f}')

# Add labels and title
plt.xlabel('Total Points Scored by Home Teams')
plt.ylabel('Frequency')
plt.title('Distribution of Total Points Scored by Home Teams')
plt.legend()

# Show plot
plt.grid(True)
plt.show()

In the competitive world of basketball, understanding the factors that contribute to a team’s success is crucial. One such factor that often comes under scrutiny is the efficiency rating of players. The player efficiency rating is a comprehensive metric that encapsulates a player’s contributions to the game, both positive and negative. In this report, we will delve into an analysis of how the player efficiency rating impacts a team’s wins. We will test the hypothesis that the player efficiency rating has a significant effect on a team’s wins. Our analysis will be based on data from various teams and players, and we will use statistical methods to draw meaningful conclusions. Let’s dive into the details.

The Efficiency Rating is a comprehensive metric that attempts to measure a player’s productivity or efficiency in the game. Here’s how it works:

PTS: Points scored by the opponent team. ORB: Offensive rebounds by the opponent team. DRB: Defensive rebounds by the opponent team. STL: Steals by the opponent team. BLK: Blocks by the opponent team. FGmissed: Field goals missed by the opponent team. FTmissed: Free throws missed by the opponent team. TO: Turnovers by the opponent team.

The Efficiency Rating is calculated using the formula: Efficiency Rating=PTS+3∗ORB+2∗DRB+STL+1.5∗BLK−FGmissed−FTmissed−TO​ / MINUTES_PLAYED This formula takes into account both the positive contributions (points, rebounds, steals, blocks) and the negative contributions (missed shots, turnovers) of the opponent team. The result is then normalized by the total minutes played. This metric can be very helpful in analyzing the performance of the opponent teams. A higher Efficiency Rating indicates a more productive or efficient performance. By comparing the Efficiency Ratings of different teams, you can gain insights into their relative strengths and weaknesses. This can be particularly useful in strategic planning and decision-making for future games. For example, if a team has a high Efficiency Rating, it might be beneficial to study their strategies and adapt your team’s defense accordingly. Similarly, if a team has a low Efficiency Rating, it might indicate areas where another team can take advantage.

In [137]:
player_df['Points'] = player_df['TOTAL_POINTS']
player_df['Offensive Rebounds'] = player_df['OFFENSIVE_REBOUND']
player_df['Defensive Rebounds'] = player_df['DEFENSIVE_REBOUND']
player_df['Steals'] = player_df['STEALS']
player_df['Blocks'] = player_df['BLOCK']
player_df['Field Goals Missed'] = player_df['FIELD_GOAL_ATTEMPTED'] - player_df['FIELD_GOALS_MADE']
player_df['Free Throws Missed'] = player_df['FREE_THROW_ATTEMPTED'] - player_df['FREE_THROWS_MADE']
player_df['Turnovers'] = player_df['TURNOVER']

player_df['Player Efficiency Rating'] = (player_df['Points'] + 3 * player_df['Offensive Rebounds'] + 2 * player_df['Defensive Rebounds'] + player_df['Steals'] + 1.5 * player_df['Blocks'] - player_df['Field Goals Missed'] - player_df['Free Throws Missed'] - player_df['Turnovers']) / player_df['MINUTES_PLAYED']

print(player_df['Player Efficiency Rating'])
0        0.510448
1        0.563043
2        0.238636
3        0.763344
4        0.446878
           ...   
31782         NaN
31783         NaN
31784         NaN
31785         NaN
31786         NaN
Name: Player Efficiency Rating, Length: 31787, dtype: float64
In [138]:
player_df['Player Efficiency Rating'].isna().sum()
Out[138]:
6681
In [139]:
player_df.columns.to_list()
Out[139]:
['SEASON',
 'PLAYER_ID',
 'PLAYER',
 'AGE',
 'EXPERIENCE',
 'TEAM_ABBREVIATION',
 'GAMES',
 'GAMES_STARTED',
 'MINUTES_PLAYED',
 'FIELD_GOALS_MADE',
 'FIELD_GOAL_ATTEMPTED',
 'FIELD_GOAL_PERCENTAGE',
 'THREE_POINTERS_MADE',
 'THREE_POINTERS_PERCENTAGE',
 'TWO_POINTERS_MADE',
 'TWO_POINTERS_PERCENTAGE',
 'EFFECTIVE_FIELD_GOAL_PERCENTAGE',
 'FREE_THROWS_MADE',
 'FREE_THROW_ATTEMPTED',
 'FREE_THROW_PERCENTAGE',
 'OFFENSIVE_REBOUND',
 'DEFENSIVE_REBOUND',
 'TOTAL_REBOUND',
 'ASSISTS',
 'STEALS',
 'BLOCK',
 'TURNOVER',
 'PERSONAL_FOUL',
 'TOTAL_POINTS',
 'TEAM',
 'OFFENSIVE_RATING',
 'DEFENSIVE RATING',
 'Points',
 'Offensive Rebounds',
 'Defensive Rebounds',
 'Steals',
 'Blocks',
 'Field Goals Missed',
 'Free Throws Missed',
 'Turnovers',
 'Player Efficiency Rating']
In [140]:
player_df = player_df.dropna(subset=['Player Efficiency Rating'])
In [141]:
player_df['Player Efficiency Rating'].isna().sum()
Out[141]:
0

Calculating the efficiency rating of all opponents against each team

In [142]:
opponent_team_df['PTS'] = opponent_team_df['OPPONENTS_POINTS']
opponent_team_df['ORB'] = opponent_team_df['OPPONENTS_OFFENSIVE_REBOUND']
opponent_team_df['DRB'] = opponent_team_df['OPPONENTS_DEFENSIVE_REBOUND']
opponent_team_df['STL'] = opponent_team_df['OPPONENTS_STEALS']
opponent_team_df['BLK'] = opponent_team_df['OPPONENTS_BLOCKS']
opponent_team_df['FGmissed'] = opponent_team_df['OPPONENT_FIELD_GOALS_ATTEMPTED'] - opponent_team_df['OPPONENT_FIELD_GOALS']
opponent_team_df['FTmissed'] = opponent_team_df['OPPONENT_FREE_THROWS_ATTEMPTED'] - opponent_team_df['OPPONENT_FREE_THROWS']
opponent_team_df['TO'] = opponent_team_df['OPPONENTS_TURNOVERS']

opponent_team_df['Efficiency Rating'] = (opponent_team_df['PTS'] + 3*opponent_team_df['ORB'] + 2*opponent_team_df['DRB'] + opponent_team_df['STL'] + 1.5*opponent_team_df['BLK'] - opponent_team_df['FGmissed'] - opponent_team_df['FTmissed'] - opponent_team_df['TO']) / opponent_team_df['MINUTES_PLAYED']

print(opponent_team_df['Efficiency Rating'])
0       0.727857
1       0.637111
2       0.686958
3       0.652687
4       0.714028
          ...   
1812    0.583080
1813    0.766626
1814    0.666135
1815    0.656155
1816    0.698643
Name: Efficiency Rating, Length: 1817, dtype: float64
In [143]:
print(opponent_team_df['Efficiency Rating'].isna().sum())
0

Ho: The player efficiency rating has no effect on a team's wins

Ha: The player efficiency rating has an effect on a team's wins

In [144]:
# Merge games_df and player_df on 'TEAM_NAME' instead of 'TEAM'
merged_df = pd.merge(merged_team_df, player_df, on='TEAM')

average_efficiency = merged_df.groupby('TEAM')['Player Efficiency Rating'].mean()

winning_teams = average_efficiency[merged_team_df['W'] == 1]
losing_teams = average_efficiency[merged_team_df['L'] == 0]

t_stat, p_value = stats.ttest_ind(winning_teams, losing_teams)
In [145]:
alpha = 0.05
if p_value < alpha:
    print("We reject the null hypothesis and conclude that the player efficiency rating does have an effect on a team's wins.")
else:
    print("We fail to reject the null hypothesis and conclude that the player efficiency rating does not have an effect on a team's wins.")
We fail to reject the null hypothesis and conclude that the player efficiency rating does not have an effect on a team's wins.

Insights:¶

Based on the Ordinary Least Squares (OLS) regression results and hypothesis testing, we can conclude that the player efficiency rating significantly impacts a team’s wins. The model, with an R-squared value of 0.850, explains a substantial portion of the variance in wins. The coefficient of the Composite Score is 1.3485, indicating a positive relationship with wins. However, the p-value from the hypothesis test, although greater than alpha, is not convincingly significant, suggesting the need for further investigation. This analysis provides a solid foundation for understanding the dynamics of team wins in NBA games, but additional research is necessary to confirm these findings.

V - Primary Analysis¶

In the following part of this project we will walk through the comprehensive process of our data analysis and machine learning model implementation. We start by preprocessing our data, which includes reading the data, calculating unique values, converting to lowercase for consistency,merging datasets, and creating new informative columns. We then move on to calculation of information gain and appliication of label encoding to categorical column. Missing values are handled appropriately to ensure the integrity of our dataset. We also perform feature selection leading to columns being dropped. After preparing our data we split it into training and testing stes and apply a Random Forest Classifier to make predictions. We evaluate our model using accuracy, confusion matrix, cross-validation scores, and generate a classification report. Furthermore, we visualize the feature importance to understand which features are driving the predictions of our model. Lastly, we introduce regularization to out model to prevent overfitting and improve its generalizability.

It should be noted that we decided not to consider "Player Efficiency Rating as we realized that while its p-value from our hypothesis testing above was greater than the alpha value, it wasn't convincing enough to be used in our final calculations for the model. We thus decided to work solely with the Team dataframe "merged_team_df" as we found that our composite score which has a pretty high information gain value (as you will se below), makes use of columns from the team dataframe. It is also worth noting that the R-squared value we got from the linear Regression using composite score is pretty high(0.85)

Let's delve into it!

In [146]:
championsdf = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/Champions and Year.csv")
<ipython-input-146-4901856a9cf1>:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  championsdf = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/Champions and Year.csv")
In [147]:
championsdf.head()
Out[147]:
YEAR CHAMPIONSHIP_TEAM
0 2023.0 DENVER NUGGETS
1 2022.0 GOLDEN STATE WARRIORS
2 2021.0 MILWAUKEE BUCKS
3 2020.0 LOS ANGELES LAKERS
4 2019.0 TORONTO RAPTORS
In [148]:
# Assuming df is your DataFrame and 'column_name' is the column you're interested in
num_unique_values = championsdf['YEAR'].nunique()

print(f"The number of unique years in the column is: {num_unique_values}")
The number of unique years in the column is: 77
In [149]:
# Convert the 'team' and 'CHAMPIONSHIP_TEAM' columns to lower case
championsdf['CHAMPIONSHIP_TEAM'] = championsdf['CHAMPIONSHIP_TEAM'].str.lower()
merged_team_df['TEAM'] = merged_team_df['TEAM'].str.lower()

# Merge the dataframes
final_team_df = pd.merge(merged_team_df, championsdf, how='left', left_on=['SEASON', 'TEAM'], right_on=['YEAR', 'CHAMPIONSHIP_TEAM'])

# Create a new column 'champion'
final_team_df['CHAMPION'] = final_team_df['YEAR'].apply(lambda x: 'yes' if pd.notnull(x) else 'no')
In [150]:
num_unique_values = final_team_df['SEASON'].nunique()

print(f"The number of seasons values in the dataframe: {num_unique_values}")
The number of seasons values in the dataframe: 78
In [151]:
count_of_value = final_team_df['CHAMPION'].value_counts()['yes']

print(f"The count of 'yes' is: {count_of_value}")
The count of 'yes' is: 75
In [152]:
final_team_df.head(50)
Out[152]:
SEASON LEAGUE_x TEAM ABBREVIATION_x PLAYOFFS_x GAMES_x MINUTES_PLAYED FIELD_GOALS_MADE FIELD_GOALS_ATTEMPTED FIELD_GOALS_PERCENTAGE ... TWO_POINTER_PERCENTAGE DEFENSIVE_REBOUND_PER_GAME ASSIST_PER_GAME POINTS_PER_GAME W L Composite Score YEAR CHAMPIONSHIP_TEAM CHAMPION
0 2024 NBA atlanta hawks ATL False 55.0 13300.0 2397.0 5144.0 0.466 ... 0.539 31.9 26.2 121.3 24.0 31.0 -7.188061 NaN NaN no
1 2024 NBA boston celtics BOS False 55.0 13325.0 2391.0 4967.0 0.481 ... 0.572 36.6 26.2 120.7 43.0 12.0 9.512361 NaN NaN no
2 2024 NBA brooklyn nets BRK False 54.0 13035.0 2257.0 4902.0 0.460 ... 0.525 32.8 26.8 113.4 21.0 33.0 -9.151929 NaN NaN no
3 2024 NBA chicago bulls CHI False 55.0 13400.0 2292.0 4904.0 0.467 ... 0.529 32.4 24.3 111.8 26.0 29.0 -6.050632 NaN NaN no
4 2024 NBA charlotte hornets CHO False 54.0 13010.0 2206.0 4790.0 0.461 ... 0.519 31.0 24.9 108.6 13.0 41.0 -19.029235 NaN NaN no
5 2024 NBA cleveland cavaliers CLE False 53.0 12770.0 2267.0 4698.0 0.483 ... 0.570 34.2 27.3 114.9 36.0 17.0 2.834733 NaN NaN no
6 2024 NBA dallas mavericks DAL False 55.0 13200.0 2362.0 4937.0 0.478 ... 0.565 32.3 25.6 118.7 32.0 23.0 -2.512589 NaN NaN no
7 2024 NBA denver nuggets DEN False 55.0 13200.0 2388.0 4883.0 0.489 ... 0.554 33.0 28.6 113.9 36.0 19.0 -0.040405 NaN NaN no
8 2024 NBA detroit pistons DET False 54.0 13035.0 2281.0 4836.0 0.472 ... 0.531 32.4 26.5 112.9 8.0 46.0 -16.458252 NaN NaN no
9 2024 NBA golden state warriors GSW False 53.0 12870.0 2311.0 4871.0 0.474 ... 0.546 34.3 29.0 119.7 27.0 26.0 -2.811058 NaN NaN no
10 2024 NBA houston rockets HOU False 54.0 13085.0 2244.0 4861.0 0.462 ... 0.529 34.7 24.8 113.3 24.0 30.0 -6.024857 NaN NaN no
11 2024 NBA indiana pacers IND False 56.0 13440.0 2622.0 5163.0 0.508 ... 0.588 30.6 30.9 123.7 31.0 25.0 0.622463 NaN NaN no
12 2024 NBA los angeles clippers LAC False 53.0 12745.0 2281.0 4622.0 0.494 ... 0.555 32.8 26.0 118.3 36.0 17.0 3.685619 NaN NaN no
13 2024 NBA los angeles lakers LAL False 56.0 13540.0 2421.0 4900.0 0.494 ... 0.561 34.5 28.5 117.0 30.0 26.0 -1.645627 NaN NaN no
14 2024 NBA memphis grizzlies MEM False 56.0 13490.0 2174.0 4929.0 0.441 ... 0.514 31.8 25.4 107.1 20.0 36.0 -15.122968 NaN NaN no
15 2024 NBA miami heat MIA False 55.0 13225.0 2183.0 4719.0 0.463 ... 0.521 32.6 25.9 110.6 30.0 25.0 -4.973745 NaN NaN no
16 2024 NBA milwaukee bucks MIL False 56.0 13515.0 2468.0 4998.0 0.494 ... 0.582 34.7 26.7 122.0 35.0 21.0 3.855911 NaN NaN no
17 2024 NBA minnesota timberwolves MIN False 55.0 13275.0 2275.0 4639.0 0.490 ... 0.550 34.7 26.4 114.0 39.0 16.0 4.795969 NaN NaN no
18 2024 NBA new orleans pelicans NOP False 55.0 13225.0 2360.0 4829.0 0.489 ... 0.553 33.3 27.1 116.5 33.0 22.0 0.996059 NaN NaN no
19 2024 NBA new york knicks NYK False 55.0 13200.0 2292.0 4906.0 0.467 ... 0.532 33.1 23.7 114.5 33.0 22.0 -0.595368 NaN NaN no
20 2024 NBA oklahoma city thunder OKC False 54.0 13060.0 2405.0 4808.0 0.500 ... 0.566 32.5 27.3 120.8 37.0 17.0 4.547477 NaN NaN no
21 2024 NBA orlando magic ORL False 55.0 13300.0 2240.0 4734.0 0.473 ... 0.544 31.9 24.9 111.7 30.0 25.0 -5.589931 NaN NaN no
22 2024 NBA philadelphia 76ers PHI False 54.0 13010.0 2296.0 4838.0 0.475 ... 0.537 32.1 24.9 118.3 32.0 22.0 1.385087 NaN NaN no
23 2024 NBA phoenix suns PHO False 55.0 13275.0 2347.0 4707.0 0.499 ... 0.568 33.5 26.8 117.6 33.0 22.0 3.284171 NaN NaN no
24 2024 NBA portland trail blazers POR False 54.0 13110.0 2119.0 4820.0 0.440 ... 0.491 29.5 22.6 107.9 15.0 39.0 -20.070180 NaN NaN no
25 2024 NBA sacramento kings SAC False 54.0 13085.0 2368.0 4908.0 0.482 ... 0.571 33.0 28.6 118.4 31.0 23.0 -3.691675 NaN NaN no
26 2024 NBA san antonio spurs SAS False 55.0 13250.0 2302.0 5012.0 0.459 ... 0.539 33.0 29.3 111.8 11.0 44.0 -15.600723 NaN NaN no
27 2024 NBA toronto raptors TOR False 55.0 13300.0 2354.0 4943.0 0.476 ... 0.545 32.2 29.2 113.9 19.0 36.0 -8.751400 NaN NaN no
28 2024 NBA utah jazz UTA False 56.0 13565.0 2398.0 5091.0 0.471 ... 0.549 33.8 28.4 118.0 26.0 30.0 -6.301609 NaN NaN no
29 2024 NBA washington wizards WAS False 54.0 12960.0 2338.0 4930.0 0.474 ... 0.550 31.5 28.1 114.8 9.0 45.0 -14.708072 NaN NaN no
30 2023 NBA atlanta hawks ATL False 82.0 19855.0 3658.0 7574.0 0.483 ... 0.548 33.2 25.0 118.4 41.0 41.0 5.039652 NaN NaN no
31 2023 NBA boston celtics BOS False 82.0 19980.0 3460.0 7278.0 0.475 ... 0.567 35.6 26.7 117.9 57.0 25.0 14.503575 NaN NaN no
32 2023 NBA brooklyn nets BRK False 82.0 19730.0 3399.0 6978.0 0.487 ... 0.559 32.3 25.5 113.4 45.0 37.0 6.229818 NaN NaN no
33 2023 NBA chicago bulls CHI False 82.0 19905.0 3488.0 7116.0 0.490 ... 0.555 33.9 24.5 113.1 40.0 42.0 6.265528 NaN NaN no
34 2023 NBA charlotte hornets CHO False 82.0 19830.0 3385.0 7413.0 0.457 ... 0.528 33.5 25.1 111.0 27.0 55.0 -5.664596 NaN NaN no
35 2023 NBA cleveland cavaliers CLE False 82.0 19880.0 3408.0 6984.0 0.488 ... 0.559 31.4 24.9 112.3 51.0 31.0 10.240028 NaN NaN no
36 2023 NBA dallas mavericks DAL False 82.0 19930.0 3283.0 6909.0 0.475 ... 0.574 31.2 22.9 114.2 38.0 44.0 3.161333 NaN NaN no
37 2023 NBA denver nuggets DEN False 82.0 19755.0 3574.0 7088.0 0.504 ... 0.575 32.9 28.9 115.8 53.0 29.0 11.046367 2023.0 denver nuggets yes
38 2023 NBA detroit pistons DET False 82.0 19805.0 3244.0 7140.0 0.454 ... 0.516 31.3 23.0 110.3 17.0 65.0 -10.766249 NaN NaN no
39 2023 NBA golden state warriors GSW False 82.0 19830.0 3538.0 7393.0 0.479 ... 0.564 34.1 29.8 118.9 44.0 38.0 8.250557 NaN NaN no
40 2023 NBA houston rockets HOU False 82.0 19755.0 3329.0 7286.0 0.457 ... 0.530 32.9 22.4 110.7 22.0 60.0 -9.554702 NaN NaN no
41 2023 NBA indiana pacers IND False 82.0 19755.0 3444.0 7345.0 0.469 ... 0.540 31.4 27.0 116.3 35.0 47.0 -0.748703 NaN NaN no
42 2023 NBA los angeles clippers LAC False 82.0 19830.0 3370.0 7059.0 0.477 ... 0.539 33.4 23.9 113.6 44.0 38.0 5.304915 NaN NaN no
43 2023 NBA los angeles lakers LAL False 82.0 19880.0 3516.0 7298.0 0.482 ... 0.555 35.7 25.3 117.2 43.0 39.0 7.100823 NaN NaN no
44 2023 NBA memphis grizzlies MEM False 82.0 19780.0 3585.0 7551.0 0.475 ... 0.548 34.6 26.0 116.9 51.0 31.0 9.904935 NaN NaN no
45 2023 NBA miami heat MIA False 82.0 19805.0 3215.0 6991.0 0.460 ... 0.540 30.9 23.8 109.5 44.0 38.0 1.289634 NaN NaN no
46 2023 NBA milwaukee bucks MIL False 82.0 19830.0 3504.0 7411.0 0.473 ... 0.557 37.5 25.8 116.9 58.0 24.0 11.618075 NaN NaN no
47 2023 NBA minnesota timberwolves MIN False 82.0 19830.0 3515.0 7167.0 0.490 ... 0.568 32.8 26.2 115.8 42.0 40.0 5.278023 NaN NaN no
48 2023 NBA new orleans pelicans NOP False 82.0 19855.0 3447.0 7180.0 0.480 ... 0.541 33.1 25.9 114.4 42.0 40.0 6.503432 NaN NaN no
49 2023 NBA new york knicks NYK False 82.0 19955.0 3444.0 7328.0 0.470 ... 0.547 34.0 22.9 116.0 47.0 35.0 6.553436 NaN NaN no

50 rows × 53 columns

In [153]:
final_team_df.columns.tolist()
Out[153]:
['SEASON',
 'LEAGUE_x',
 'TEAM',
 'ABBREVIATION_x',
 'PLAYOFFS_x',
 'GAMES_x',
 'MINUTES_PLAYED',
 'FIELD_GOALS_MADE',
 'FIELD_GOALS_ATTEMPTED',
 'FIELD_GOALS_PERCENTAGE',
 'TWO_POINTERS_MADE',
 'TWO_POINTERS_PERCENTAGE',
 'FREE_THROWS_MADE',
 'FREE_THROWS_ATTEMPTED',
 'FREE_THROW_PERCENTAGE_x',
 'DEFENSIVE_REBOUND',
 'TOTAL_REBOUNDS',
 'ASSISTS',
 'STEALS',
 'BLOCKS',
 'TURNOVERS',
 'POINTS',
 'L_x',
 'PLAYOFFS_y',
 'AGE',
 'PW',
 'PL',
 'MARGIN_OF_VICTORY',
 'STRENGTH_OF_SCHEDULE',
 'SIMPLE_RATING_SYSTEM',
 'OFFENSIVE_RATING',
 'DEFENSIVE_RATING',
 'NET_RATING',
 'TRUE_SHOOTING_PERCENTAGE',
 'EFFECTIEV_FIELD_GOAL_PERCENTAGE',
 'TURNOVER_PERCENTAGE',
 'OPPONENT_EFFECTIVE_FIELD_GAL',
 'OPPONENT_FREE_THROW_PERCENTAGE.1',
 'LEAGUE',
 'ABBREVIATION',
 'PLAYOFFS',
 'fg_per_game',
 'FIELD_GOAL_PERCENTAGE',
 'TWO_POINTER_PERCENTAGE',
 'DEFENSIVE_REBOUND_PER_GAME',
 'ASSIST_PER_GAME',
 'POINTS_PER_GAME',
 'W',
 'L',
 'Composite Score',
 'YEAR',
 'CHAMPIONSHIP_TEAM',
 'CHAMPION']

Dropping some columns that might not be relevant

In [154]:
final_team_df.drop('LEAGUE', axis=1, inplace=True)
final_team_df.drop('LEAGUE_x', axis=1, inplace=True)
final_team_df.drop('ABBREVIATION', axis=1, inplace=True)
final_team_df.drop('ABBREVIATION_x', axis=1, inplace=True)
final_team_df.drop('PLAYOFFS_x', axis=1, inplace=True)
final_team_df.drop('PLAYOFFS_y', axis=1, inplace=True)
In [155]:
final_team_df.dtypes
Out[155]:
SEASON                                int64
TEAM                                 object
GAMES_x                             float64
MINUTES_PLAYED                      float64
FIELD_GOALS_MADE                    float64
FIELD_GOALS_ATTEMPTED               float64
FIELD_GOALS_PERCENTAGE              float64
TWO_POINTERS_MADE                   float64
TWO_POINTERS_PERCENTAGE             float64
FREE_THROWS_MADE                    float64
FREE_THROWS_ATTEMPTED               float64
FREE_THROW_PERCENTAGE_x             float64
DEFENSIVE_REBOUND                   float64
TOTAL_REBOUNDS                      float64
ASSISTS                             float64
STEALS                              float64
BLOCKS                              float64
TURNOVERS                           float64
POINTS                              float64
L_x                                 float64
AGE                                 float64
PW                                  float64
PL                                  float64
MARGIN_OF_VICTORY                   float64
STRENGTH_OF_SCHEDULE                float64
SIMPLE_RATING_SYSTEM                float64
OFFENSIVE_RATING                    float64
DEFENSIVE_RATING                    float64
NET_RATING                          float64
TRUE_SHOOTING_PERCENTAGE            float64
EFFECTIEV_FIELD_GOAL_PERCENTAGE     float64
TURNOVER_PERCENTAGE                 float64
OPPONENT_EFFECTIVE_FIELD_GAL        float64
OPPONENT_FREE_THROW_PERCENTAGE.1    float64
PLAYOFFS                             object
fg_per_game                         float64
FIELD_GOAL_PERCENTAGE               float64
TWO_POINTER_PERCENTAGE              float64
DEFENSIVE_REBOUND_PER_GAME          float64
ASSIST_PER_GAME                     float64
POINTS_PER_GAME                     float64
W                                   float64
L                                   float64
Composite Score                     float64
YEAR                                float64
CHAMPIONSHIP_TEAM                    object
CHAMPION                             object
dtype: object

We will be applying One Hot Encoding to categorical data to make sure our dataframe is mainly numeric.

In [156]:
encoded_final_team_df = pd.get_dummies(final_team_df, columns=['PLAYOFFS', 'CHAMPION'], drop_first= True)
encoded_final_team_df.rename(columns={'PLAYOFFS_True': 'PLAYOFFS', 'CHAMPION_yes' : 'CHAMPION'}, inplace=True)
# Convert boolean columns to integer (0 and 1)
encoded_final_team_df['PLAYOFFS'] = encoded_final_team_df['PLAYOFFS'].astype(int)
encoded_final_team_df['CHAMPION'] = encoded_final_team_df['CHAMPION'].astype(int)

encoded_final_team_df
Out[156]:
SEASON TEAM GAMES_x MINUTES_PLAYED FIELD_GOALS_MADE FIELD_GOALS_ATTEMPTED FIELD_GOALS_PERCENTAGE TWO_POINTERS_MADE TWO_POINTERS_PERCENTAGE FREE_THROWS_MADE ... DEFENSIVE_REBOUND_PER_GAME ASSIST_PER_GAME POINTS_PER_GAME W L Composite Score YEAR CHAMPIONSHIP_TEAM PLAYOFFS CHAMPION
0 2024 atlanta hawks 55.0 13300.000000 2397.0 5144.0 0.466 1641.0 0.539 1123.0 ... 31.900000 26.2 121.3 24.0 31.0 -7.188061 NaN NaN 0 0
1 2024 boston celtics 55.0 13325.000000 2391.0 4967.0 0.481 1498.0 0.572 962.0 ... 36.600000 26.2 120.7 43.0 12.0 9.512361 NaN NaN 0 0
2 2024 brooklyn nets 54.0 13035.000000 2257.0 4902.0 0.460 1510.0 0.525 860.0 ... 32.800000 26.8 113.4 21.0 33.0 -9.151929 NaN NaN 0 0
3 2024 chicago bulls 55.0 13400.000000 2292.0 4904.0 0.467 1644.0 0.529 917.0 ... 32.400000 24.3 111.8 26.0 29.0 -6.050632 NaN NaN 0 0
4 2024 charlotte hornets 54.0 13010.000000 2206.0 4790.0 0.461 1559.0 0.519 808.0 ... 31.000000 24.9 108.6 13.0 41.0 -19.029235 NaN NaN 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1600 1948 new york knicks 48.0 19400.166667 1355.0 4724.0 0.287 1355.0 0.287 868.0 ... 30.509804 7.8 74.5 26.0 22.0 -21.047678 NaN NaN 1 0
1601 1948 washington capitols 48.0 20235.000000 1336.0 4785.0 0.279 1336.0 0.279 865.0 ... 36.900000 6.4 73.7 28.0 20.0 -23.873765 NaN NaN 1 0
1602 1947 boston celtics 60.0 19403.083333 1397.0 5133.0 0.272 1397.0 0.272 811.0 ... 31.650980 7.8 60.1 22.0 38.0 -29.726828 NaN NaN 0 0
1603 1947 new york knicks 60.0 19400.166667 1465.0 5255.0 0.279 1465.0 0.279 951.0 ... 30.509804 7.6 64.7 33.0 27.0 -23.842757 NaN NaN 1 0
1604 1947 washington capitols 60.0 20235.000000 1723.0 5794.0 0.297 1723.0 0.297 982.0 ... 36.900000 6.3 73.8 49.0 11.0 -13.805353 NaN NaN 1 0

1605 rows × 47 columns

We removed all rows for 2024 as taht were concerned with data from 2024 as this is the year we will be predicting at the end of this project.

In [157]:
encoded_final_team_df = encoded_final_team_df[encoded_final_team_df['SEASON'] != 2024]
In [158]:
encoded_final_team_df.columns.tolist()
Out[158]:
['SEASON',
 'TEAM',
 'GAMES_x',
 'MINUTES_PLAYED',
 'FIELD_GOALS_MADE',
 'FIELD_GOALS_ATTEMPTED',
 'FIELD_GOALS_PERCENTAGE',
 'TWO_POINTERS_MADE',
 'TWO_POINTERS_PERCENTAGE',
 'FREE_THROWS_MADE',
 'FREE_THROWS_ATTEMPTED',
 'FREE_THROW_PERCENTAGE_x',
 'DEFENSIVE_REBOUND',
 'TOTAL_REBOUNDS',
 'ASSISTS',
 'STEALS',
 'BLOCKS',
 'TURNOVERS',
 'POINTS',
 'L_x',
 'AGE',
 'PW',
 'PL',
 'MARGIN_OF_VICTORY',
 'STRENGTH_OF_SCHEDULE',
 'SIMPLE_RATING_SYSTEM',
 'OFFENSIVE_RATING',
 'DEFENSIVE_RATING',
 'NET_RATING',
 'TRUE_SHOOTING_PERCENTAGE',
 'EFFECTIEV_FIELD_GOAL_PERCENTAGE',
 'TURNOVER_PERCENTAGE',
 'OPPONENT_EFFECTIVE_FIELD_GAL',
 'OPPONENT_FREE_THROW_PERCENTAGE.1',
 'fg_per_game',
 'FIELD_GOAL_PERCENTAGE',
 'TWO_POINTER_PERCENTAGE',
 'DEFENSIVE_REBOUND_PER_GAME',
 'ASSIST_PER_GAME',
 'POINTS_PER_GAME',
 'W',
 'L',
 'Composite Score',
 'YEAR',
 'CHAMPIONSHIP_TEAM',
 'PLAYOFFS',
 'CHAMPION']

Calculating Information Gain to find which values are more likely to be related to a team's win of the championship.

In [159]:
def information_gain(data, feature, target):
    # Calculate the entropy of the data before the split
    entropy_before = entropy(data[target])

    # Calculate the entropy of the data after the split
    grouped_entropy = data.groupby(feature)[target].apply(entropy)
    entropy_after = (data.groupby(feature).size() / len(data)) @ grouped_entropy

    # Calculate the information gain
    information_gain = entropy_before - entropy_after
    return information_gain

def entropy(data):
    # Calculate the probability of each class
    probabilities = data.value_counts(normalize=True)

    # Calculate the entropy
    entropy = - (probabilities * probabilities.apply(math.log2)).sum()
    return entropy

# Example DataFrame
encoded_final_team_df = pd.DataFrame(encoded_final_team_df)  # Your DataFrame here

# Calculate the information gain for each column against 'CHAMPION'
target_feature = 'CHAMPION'
information_gains = {}
for column in encoded_final_team_df.columns:
    if column != target_feature:  # Skip the target feature itself
        information_gains[column] = information_gain(encoded_final_team_df, column, target_feature)

# Print the information gains
for column, information_gain in information_gains.items():
    print(f"Information gain for '{column}' against '{target_feature}': {information_gain}")
Information gain for 'SEASON' against 'CHAMPION': 0.014950515682800036
Information gain for 'TEAM' against 'CHAMPION': 0.05767933757707558
Information gain for 'GAMES_x' against 'CHAMPION': 0.011448743875787482
Information gain for 'MINUTES_PLAYED' against 'CHAMPION': 0.047207293700936154
Information gain for 'FIELD_GOALS_MADE' against 'CHAMPION': 0.20752560039814788
Information gain for 'FIELD_GOALS_ATTEMPTED' against 'CHAMPION': 0.23226885809553627
Information gain for 'FIELD_GOALS_PERCENTAGE' against 'CHAMPION': 0.08268211670982178
Information gain for 'TWO_POINTERS_MADE' against 'CHAMPION': 0.2212974000627263
Information gain for 'TWO_POINTERS_PERCENTAGE' against 'CHAMPION': 0.08535481706440465
Information gain for 'FREE_THROWS_MADE' against 'CHAMPION': 0.18479791598779124
Information gain for 'FREE_THROWS_ATTEMPTED' against 'CHAMPION': 0.20856399533363978
Information gain for 'FREE_THROW_PERCENTAGE_x' against 'CHAMPION': 0.05697120368231767
Information gain for 'DEFENSIVE_REBOUND' against 'CHAMPION': 0.16205761003546354
Information gain for 'TOTAL_REBOUNDS' against 'CHAMPION': 0.21047493558169977
Information gain for 'ASSISTS' against 'CHAMPION': 0.18527630668706338
Information gain for 'STEALS' against 'CHAMPION': 0.13934322636949498
Information gain for 'BLOCKS' against 'CHAMPION': 0.12884232380171226
Information gain for 'TURNOVERS' against 'CHAMPION': 0.16491837011046245
Information gain for 'POINTS' against 'CHAMPION': 0.22373244956343707
Information gain for 'L_x' against 'CHAMPION': 0.10928455726939384
Information gain for 'AGE' against 'CHAMPION': 0.06424082812056756
Information gain for 'PW' against 'CHAMPION': 0.0962447338650389
Information gain for 'PL' against 'CHAMPION': 0.09290193883833536
Information gain for 'MARGIN_OF_VICTORY' against 'CHAMPION': 0.2476115863726033
Information gain for 'STRENGTH_OF_SCHEDULE' against 'CHAMPION': 0.09360754023198553
Information gain for 'SIMPLE_RATING_SYSTEM' against 'CHAMPION': 0.22808467367281696
Information gain for 'OFFENSIVE_RATING' against 'CHAMPION': 0.11167737066739572
Information gain for 'DEFENSIVE_RATING' against 'CHAMPION': 0.13150596965091188
Information gain for 'NET_RATING' against 'CHAMPION': 0.13423877868743925
Information gain for 'TRUE_SHOOTING_PERCENTAGE' against 'CHAMPION': 0.07878275069041518
Information gain for 'EFFECTIEV_FIELD_GOAL_PERCENTAGE' against 'CHAMPION': 0.08628835913075475
Information gain for 'TURNOVER_PERCENTAGE' against 'CHAMPION': 0.049494424878997134
Information gain for 'OPPONENT_EFFECTIVE_FIELD_GAL' against 'CHAMPION': 0.09005161487932403
Information gain for 'OPPONENT_FREE_THROW_PERCENTAGE.1' against 'CHAMPION': 0.07570697757327138
Information gain for 'fg_per_game' against 'CHAMPION': 0.09389155654177977
Information gain for 'FIELD_GOAL_PERCENTAGE' against 'CHAMPION': 0.08268211670982178
Information gain for 'TWO_POINTER_PERCENTAGE' against 'CHAMPION': 0.08535481706440465
Information gain for 'DEFENSIVE_REBOUND_PER_GAME' against 'CHAMPION': 0.07873905171513682
Information gain for 'ASSIST_PER_GAME' against 'CHAMPION': 0.07003492190767108
Information gain for 'POINTS_PER_GAME' against 'CHAMPION': 0.11811261893620706
Information gain for 'W' against 'CHAMPION': 0.11243597237104624
Information gain for 'L' against 'CHAMPION': 0.10928455726939384
Information gain for 'Composite Score' against 'CHAMPION': 0.2761954276479391
Information gain for 'YEAR' against 'CHAMPION': 0.2761954276479391
Information gain for 'CHAMPIONSHIP_TEAM' against 'CHAMPION': 0.2761954276479391
Information gain for 'PLAYOFFS' against 'CHAMPION': 0.03535872083113947

By scaling the information gain values, we mitigate the potential dominance of features with larger scales, allowing for a more balanced evaluation of each feature's importance in predicting championship outcomes. This normalization process enhances the robustness and interpretability of our model, enabling us to make informed decisions about which features to include in our regression analysis for more reliable championship predictions.

In [160]:
encoded_final_team_df.head(50)
Out[160]:
SEASON TEAM GAMES_x MINUTES_PLAYED FIELD_GOALS_MADE FIELD_GOALS_ATTEMPTED FIELD_GOALS_PERCENTAGE TWO_POINTERS_MADE TWO_POINTERS_PERCENTAGE FREE_THROWS_MADE ... DEFENSIVE_REBOUND_PER_GAME ASSIST_PER_GAME POINTS_PER_GAME W L Composite Score YEAR CHAMPIONSHIP_TEAM PLAYOFFS CHAMPION
30 2023 atlanta hawks 82.0 19855.0 3658.0 7574.0 0.483 2776.0 0.548 1513.0 ... 33.2 25.0 118.4 41.0 41.0 5.039652 NaN NaN 0 0
31 2023 boston celtics 82.0 19980.0 3460.0 7278.0 0.475 2145.0 0.567 1436.0 ... 35.6 26.7 117.9 57.0 25.0 14.503575 NaN NaN 0 0
32 2023 brooklyn nets 82.0 19730.0 3399.0 6978.0 0.487 2351.0 0.559 1449.0 ... 32.3 25.5 113.4 45.0 37.0 6.229818 NaN NaN 0 0
33 2023 chicago bulls 82.0 19905.0 3488.0 7116.0 0.490 2634.0 0.555 1446.0 ... 33.9 24.5 113.1 40.0 42.0 6.265528 NaN NaN 0 0
34 2023 charlotte hornets 82.0 19830.0 3385.0 7413.0 0.457 2504.0 0.528 1447.0 ... 33.5 25.1 111.0 27.0 55.0 -5.664596 NaN NaN 0 0
35 2023 cleveland cavaliers 82.0 19880.0 3408.0 6984.0 0.488 2458.0 0.559 1439.0 ... 31.4 24.9 112.3 51.0 31.0 10.240028 NaN NaN 0 0
36 2023 dallas mavericks 82.0 19930.0 3283.0 6909.0 0.475 2037.0 0.574 1554.0 ... 31.2 22.9 114.2 38.0 44.0 3.161333 NaN NaN 0 0
37 2023 denver nuggets 82.0 19755.0 3574.0 7088.0 0.504 2605.0 0.575 1378.0 ... 32.9 28.9 115.8 53.0 29.0 11.046367 2023.0 denver nuggets 0 1
38 2023 detroit pistons 82.0 19805.0 3244.0 7140.0 0.454 2310.0 0.516 1623.0 ... 31.3 23.0 110.3 17.0 65.0 -10.766249 NaN NaN 0 0
39 2023 golden state warriors 82.0 19830.0 3538.0 7393.0 0.479 2175.0 0.564 1314.0 ... 34.1 29.8 118.9 44.0 38.0 8.250557 NaN NaN 0 0
40 2023 houston rockets 82.0 19755.0 3329.0 7286.0 0.457 2473.0 0.530 1567.0 ... 32.9 22.4 110.7 22.0 60.0 -9.554702 NaN NaN 0 0
41 2023 indiana pacers 82.0 19755.0 3444.0 7345.0 0.469 2332.0 0.540 1535.0 ... 31.4 27.0 116.3 35.0 47.0 -0.748703 NaN NaN 0 0
42 2023 los angeles clippers 82.0 19830.0 3370.0 7059.0 0.477 2329.0 0.539 1533.0 ... 33.4 23.9 113.6 44.0 38.0 5.304915 NaN NaN 0 0
43 2023 los angeles lakers 82.0 19880.0 3516.0 7298.0 0.482 2631.0 0.555 1691.0 ... 35.7 25.3 117.2 43.0 39.0 7.100823 NaN NaN 0 0
44 2023 memphis grizzlies 82.0 19780.0 3585.0 7551.0 0.475 2600.0 0.548 1432.0 ... 34.6 26.0 116.9 51.0 31.0 9.904935 NaN NaN 0 0
45 2023 miami heat 82.0 19805.0 3215.0 6991.0 0.460 2235.0 0.540 1567.0 ... 30.9 23.8 109.5 44.0 38.0 1.289634 NaN NaN 0 0
46 2023 milwaukee bucks 82.0 19830.0 3504.0 7411.0 0.473 2287.0 0.557 1364.0 ... 37.5 25.8 116.9 58.0 24.0 11.618075 NaN NaN 0 0
47 2023 minnesota timberwolves 82.0 19830.0 3515.0 7167.0 0.490 2518.0 0.568 1467.0 ... 32.8 26.2 115.8 42.0 40.0 5.278023 NaN NaN 0 0
48 2023 new orleans pelicans 82.0 19855.0 3447.0 7180.0 0.480 2548.0 0.541 1585.0 ... 33.1 25.9 114.4 42.0 40.0 6.503432 NaN NaN 0 0
49 2023 new york knicks 82.0 19955.0 3444.0 7328.0 0.470 2407.0 0.547 1589.0 ... 34.0 22.9 116.0 47.0 35.0 6.553436 NaN NaN 0 0
50 2023 oklahoma city thunder 82.0 19855.0 3533.0 7590.0 0.465 2538.0 0.530 1572.0 ... 32.3 24.4 117.5 40.0 42.0 3.638798 NaN NaN 0 0
51 2023 orlando magic 82.0 19780.0 3323.0 7074.0 0.470 2440.0 0.539 1607.0 ... 33.1 23.2 111.4 34.0 48.0 -1.575875 NaN NaN 0 0
52 2023 philadelphia 76ers 82.0 19880.0 3347.0 6870.0 0.487 2312.0 0.551 1719.0 ... 32.2 25.2 115.2 54.0 28.0 10.753173 NaN NaN 0 0
53 2023 phoenix suns 82.0 19780.0 3453.0 7388.0 0.467 2452.0 0.520 1412.0 ... 32.4 27.3 113.6 45.0 37.0 6.281789 NaN NaN 0 0
54 2023 portland trail blazers 82.0 19730.0 3317.0 7001.0 0.474 2261.0 0.550 1609.0 ... 31.1 24.2 113.4 33.0 49.0 -2.474210 NaN NaN 0 0
55 2023 sacramento kings 82.0 19830.0 3573.0 7232.0 0.494 2445.0 0.586 1624.0 ... 32.9 27.3 120.7 48.0 34.0 9.562584 NaN NaN 0 0
56 2023 san antonio spurs 82.0 19855.0 3533.0 7593.0 0.465 2622.0 0.529 1292.0 ... 31.9 27.2 113.0 22.0 60.0 -10.049878 NaN NaN 0 0
57 2023 toronto raptors 82.0 19805.0 3434.0 7489.0 0.459 2554.0 0.525 1506.0 ... 30.3 23.9 112.9 41.0 41.0 3.093160 NaN NaN 0 0
58 2023 utah jazz 82.0 19805.0 3485.0 7365.0 0.473 2391.0 0.560 1536.0 ... 34.1 26.0 117.1 37.0 45.0 3.838556 NaN NaN 0 0
59 2023 washington wizards 82.0 19755.0 3456.0 7126.0 0.485 2531.0 0.559 1442.0 ... 34.2 25.4 113.2 35.0 47.0 2.969104 NaN NaN 0 0
60 2022 atlanta hawks 82.0 19705.0 3401.0 7241.0 0.470 2345.0 0.531 1485.0 ... 33.9 24.6 113.9 43.0 39.0 5.432341 NaN NaN 1 0
61 2022 boston celtics 82.0 19905.0 3341.0 7167.0 0.466 2256.0 0.547 1397.0 ... 35.5 24.8 111.8 51.0 31.0 13.206362 NaN NaN 1 0
62 2022 brooklyn nets 82.0 19755.0 3442.0 7251.0 0.475 2502.0 0.538 1434.0 ... 34.1 25.3 112.9 44.0 38.0 6.103811 NaN NaN 1 0
63 2022 chicago bulls 82.0 19730.0 3422.0 7127.0 0.480 2550.0 0.535 1436.0 ... 33.7 23.9 111.6 46.0 36.0 3.285980 NaN NaN 1 0
64 2022 charlotte hornets 82.0 19880.0 3508.0 7497.0 0.468 2365.0 0.542 1298.0 ... 33.7 28.1 115.3 43.0 39.0 4.695518 NaN NaN 0 0
65 2022 cleveland cavaliers 82.0 19730.0 3255.0 6940.0 0.469 2302.0 0.541 1376.0 ... 34.0 25.2 107.8 44.0 38.0 5.081598 NaN NaN 0 0
66 2022 dallas mavericks 82.0 19755.0 3222.0 6982.0 0.461 2149.0 0.548 1341.0 ... 33.8 23.4 108.0 52.0 30.0 6.743320 NaN NaN 1 0
67 2022 denver nuggets 82.0 19805.0 3416.0 7079.0 0.483 2377.0 0.575 1372.0 ... 34.9 27.8 112.7 48.0 34.0 8.835371 NaN NaN 1 0
68 2022 detroit pistons 82.0 19780.0 3129.0 7267.0 0.431 2204.0 0.498 1413.0 ... 32.0 23.5 104.8 23.0 59.0 -11.534968 NaN NaN 0 0
69 2022 golden state warriors 82.0 19730.0 3323.0 7087.0 0.469 2147.0 0.557 1280.0 ... 35.7 27.1 111.0 53.0 29.0 11.684598 2022.0 golden state warriors 1 1
70 2022 houston rockets 82.0 19755.0 3229.0 7083.0 0.456 2124.0 0.543 1434.0 ... 32.4 23.6 109.7 20.0 62.0 -9.883424 NaN NaN 0 0
71 2022 indiana pacers 82.0 19880.0 3398.0 7338.0 0.463 2401.0 0.541 1347.0 ... 32.6 25.4 111.5 25.0 57.0 -2.462554 NaN NaN 0 0
72 2022 los angeles clippers 82.0 19780.0 3285.0 7170.0 0.458 2238.0 0.512 1273.0 ... 34.9 24.0 108.4 42.0 40.0 2.693561 NaN NaN 0 0
73 2022 los angeles lakers 82.0 19980.0 3415.0 7279.0 0.469 2433.0 0.546 1380.0 ... 34.5 24.0 112.1 33.0 49.0 0.602110 NaN NaN 0 0
74 2022 memphis grizzlies 82.0 19780.0 3571.0 7739.0 0.461 2626.0 0.519 1393.0 ... 35.0 26.0 115.6 56.0 26.0 11.182765 NaN NaN 1 0
75 2022 miami heat 82.0 19855.0 3246.0 6954.0 0.467 2132.0 0.531 1416.0 ... 33.9 25.5 110.0 53.0 29.0 9.139626 NaN NaN 1 0
76 2022 milwaukee bucks 82.0 19755.0 3429.0 7331.0 0.468 2276.0 0.544 1459.0 ... 36.5 23.9 115.5 51.0 31.0 9.273393 NaN NaN 1 0
77 2022 minnesota timberwolves 82.0 19780.0 3411.0 7458.0 0.457 2200.0 0.540 1474.0 ... 32.9 25.7 115.9 46.0 36.0 6.222438 NaN NaN 1 0
78 2022 new orleans pelicans 82.0 19755.0 3294.0 7212.0 0.457 2421.0 0.528 1501.0 ... 33.2 25.0 109.3 36.0 46.0 0.240126 NaN NaN 1 0
79 2022 new york knicks 82.0 19780.0 3088.0 7069.0 0.437 2006.0 0.497 1473.0 ... 34.6 21.9 106.5 37.0 45.0 -0.549704 NaN NaN 0 0

50 rows × 47 columns

We used a Label Encoder to make our Team and Championship Team columns to be numerical. Note that we didn't use regular One Hot Encoding here as there are way more than 2 categories for each of these columns.

In [161]:
# Initialize LabelEncoder
le = LabelEncoder()

# Fit and transform the 'TEAM' and 'CHAMPIONSHIP_TEAM' columns
encoded_final_team_df['TEAM'] = le.fit_transform(encoded_final_team_df['TEAM'])
encoded_final_team_df['CHAMPIONSHIP_TEAM'] = le.fit_transform(encoded_final_team_df['CHAMPIONSHIP_TEAM'])

# Print the first 50 rows of the dataframe
encoded_final_team_df.head(50)
Out[161]:
SEASON TEAM GAMES_x MINUTES_PLAYED FIELD_GOALS_MADE FIELD_GOALS_ATTEMPTED FIELD_GOALS_PERCENTAGE TWO_POINTERS_MADE TWO_POINTERS_PERCENTAGE FREE_THROWS_MADE ... DEFENSIVE_REBOUND_PER_GAME ASSIST_PER_GAME POINTS_PER_GAME W L Composite Score YEAR CHAMPIONSHIP_TEAM PLAYOFFS CHAMPION
30 2023 0 82.0 19855.0 3658.0 7574.0 0.483 2776.0 0.548 1513.0 ... 33.2 25.0 118.4 41.0 41.0 5.039652 NaN 24 0 0
31 2023 1 82.0 19980.0 3460.0 7278.0 0.475 2145.0 0.567 1436.0 ... 35.6 26.7 117.9 57.0 25.0 14.503575 NaN 24 0 0
32 2023 2 82.0 19730.0 3399.0 6978.0 0.487 2351.0 0.559 1449.0 ... 32.3 25.5 113.4 45.0 37.0 6.229818 NaN 24 0 0
33 2023 8 82.0 19905.0 3488.0 7116.0 0.490 2634.0 0.555 1446.0 ... 33.9 24.5 113.1 40.0 42.0 6.265528 NaN 24 0 0
34 2023 7 82.0 19830.0 3385.0 7413.0 0.457 2504.0 0.528 1447.0 ... 33.5 25.1 111.0 27.0 55.0 -5.664596 NaN 24 0 0
35 2023 9 82.0 19880.0 3408.0 6984.0 0.488 2458.0 0.559 1439.0 ... 31.4 24.9 112.3 51.0 31.0 10.240028 NaN 24 0 0
36 2023 11 82.0 19930.0 3283.0 6909.0 0.475 2037.0 0.574 1554.0 ... 31.2 22.9 114.2 38.0 44.0 3.161333 NaN 24 0 0
37 2023 12 82.0 19755.0 3574.0 7088.0 0.504 2605.0 0.575 1378.0 ... 32.9 28.9 115.8 53.0 29.0 11.046367 2023.0 4 0 1
38 2023 14 82.0 19805.0 3244.0 7140.0 0.454 2310.0 0.516 1623.0 ... 31.3 23.0 110.3 17.0 65.0 -10.766249 NaN 24 0 0
39 2023 15 82.0 19830.0 3538.0 7393.0 0.479 2175.0 0.564 1314.0 ... 34.1 29.8 118.9 44.0 38.0 8.250557 NaN 24 0 0
40 2023 17 82.0 19755.0 3329.0 7286.0 0.457 2473.0 0.530 1567.0 ... 32.9 22.4 110.7 22.0 60.0 -9.554702 NaN 24 0 0
41 2023 18 82.0 19755.0 3444.0 7345.0 0.469 2332.0 0.540 1535.0 ... 31.4 27.0 116.3 35.0 47.0 -0.748703 NaN 24 0 0
42 2023 22 82.0 19830.0 3370.0 7059.0 0.477 2329.0 0.539 1533.0 ... 33.4 23.9 113.6 44.0 38.0 5.304915 NaN 24 0 0
43 2023 23 82.0 19880.0 3516.0 7298.0 0.482 2631.0 0.555 1691.0 ... 35.7 25.3 117.2 43.0 39.0 7.100823 NaN 24 0 0
44 2023 25 82.0 19780.0 3585.0 7551.0 0.475 2600.0 0.548 1432.0 ... 34.6 26.0 116.9 51.0 31.0 9.904935 NaN 24 0 0
45 2023 30 82.0 19805.0 3215.0 6991.0 0.460 2235.0 0.540 1567.0 ... 30.9 23.8 109.5 44.0 38.0 1.289634 NaN 24 0 0
46 2023 31 82.0 19830.0 3504.0 7411.0 0.473 2287.0 0.557 1364.0 ... 37.5 25.8 116.9 58.0 24.0 11.618075 NaN 24 0 0
47 2023 33 82.0 19830.0 3515.0 7167.0 0.490 2518.0 0.568 1467.0 ... 32.8 26.2 115.8 42.0 40.0 5.278023 NaN 24 0 0
48 2023 38 82.0 19855.0 3447.0 7180.0 0.480 2548.0 0.541 1585.0 ... 33.1 25.9 114.4 42.0 40.0 6.503432 NaN 24 0 0
49 2023 40 82.0 19955.0 3444.0 7328.0 0.470 2407.0 0.547 1589.0 ... 34.0 22.9 116.0 47.0 35.0 6.553436 NaN 24 0 0
50 2023 43 82.0 19855.0 3533.0 7590.0 0.465 2538.0 0.530 1572.0 ... 32.3 24.4 117.5 40.0 42.0 3.638798 NaN 24 0 0
51 2023 44 82.0 19780.0 3323.0 7074.0 0.470 2440.0 0.539 1607.0 ... 33.1 23.2 111.4 34.0 48.0 -1.575875 NaN 24 0 0
52 2023 45 82.0 19880.0 3347.0 6870.0 0.487 2312.0 0.551 1719.0 ... 32.2 25.2 115.2 54.0 28.0 10.753173 NaN 24 0 0
53 2023 46 82.0 19780.0 3453.0 7388.0 0.467 2452.0 0.520 1412.0 ... 32.4 27.3 113.6 45.0 37.0 6.281789 NaN 24 0 0
54 2023 49 82.0 19730.0 3317.0 7001.0 0.474 2261.0 0.550 1609.0 ... 31.1 24.2 113.4 33.0 49.0 -2.474210 NaN 24 0 0
55 2023 50 82.0 19830.0 3573.0 7232.0 0.494 2445.0 0.586 1624.0 ... 32.9 27.3 120.7 48.0 34.0 9.562584 NaN 24 0 0
56 2023 51 82.0 19855.0 3533.0 7593.0 0.465 2622.0 0.529 1292.0 ... 31.9 27.2 113.0 22.0 60.0 -10.049878 NaN 24 0 0
57 2023 59 82.0 19805.0 3434.0 7489.0 0.459 2554.0 0.525 1506.0 ... 30.3 23.9 112.9 41.0 41.0 3.093160 NaN 24 0 0
58 2023 60 82.0 19805.0 3485.0 7365.0 0.473 2391.0 0.560 1536.0 ... 34.1 26.0 117.1 37.0 45.0 3.838556 NaN 24 0 0
59 2023 66 82.0 19755.0 3456.0 7126.0 0.485 2531.0 0.559 1442.0 ... 34.2 25.4 113.2 35.0 47.0 2.969104 NaN 24 0 0
60 2022 0 82.0 19705.0 3401.0 7241.0 0.470 2345.0 0.531 1485.0 ... 33.9 24.6 113.9 43.0 39.0 5.432341 NaN 24 1 0
61 2022 1 82.0 19905.0 3341.0 7167.0 0.466 2256.0 0.547 1397.0 ... 35.5 24.8 111.8 51.0 31.0 13.206362 NaN 24 1 0
62 2022 2 82.0 19755.0 3442.0 7251.0 0.475 2502.0 0.538 1434.0 ... 34.1 25.3 112.9 44.0 38.0 6.103811 NaN 24 1 0
63 2022 8 82.0 19730.0 3422.0 7127.0 0.480 2550.0 0.535 1436.0 ... 33.7 23.9 111.6 46.0 36.0 3.285980 NaN 24 1 0
64 2022 7 82.0 19880.0 3508.0 7497.0 0.468 2365.0 0.542 1298.0 ... 33.7 28.1 115.3 43.0 39.0 4.695518 NaN 24 0 0
65 2022 9 82.0 19730.0 3255.0 6940.0 0.469 2302.0 0.541 1376.0 ... 34.0 25.2 107.8 44.0 38.0 5.081598 NaN 24 0 0
66 2022 11 82.0 19755.0 3222.0 6982.0 0.461 2149.0 0.548 1341.0 ... 33.8 23.4 108.0 52.0 30.0 6.743320 NaN 24 1 0
67 2022 12 82.0 19805.0 3416.0 7079.0 0.483 2377.0 0.575 1372.0 ... 34.9 27.8 112.7 48.0 34.0 8.835371 NaN 24 1 0
68 2022 14 82.0 19780.0 3129.0 7267.0 0.431 2204.0 0.498 1413.0 ... 32.0 23.5 104.8 23.0 59.0 -11.534968 NaN 24 0 0
69 2022 15 82.0 19730.0 3323.0 7087.0 0.469 2147.0 0.557 1280.0 ... 35.7 27.1 111.0 53.0 29.0 11.684598 2022.0 6 1 1
70 2022 17 82.0 19755.0 3229.0 7083.0 0.456 2124.0 0.543 1434.0 ... 32.4 23.6 109.7 20.0 62.0 -9.883424 NaN 24 0 0
71 2022 18 82.0 19880.0 3398.0 7338.0 0.463 2401.0 0.541 1347.0 ... 32.6 25.4 111.5 25.0 57.0 -2.462554 NaN 24 0 0
72 2022 22 82.0 19780.0 3285.0 7170.0 0.458 2238.0 0.512 1273.0 ... 34.9 24.0 108.4 42.0 40.0 2.693561 NaN 24 0 0
73 2022 23 82.0 19980.0 3415.0 7279.0 0.469 2433.0 0.546 1380.0 ... 34.5 24.0 112.1 33.0 49.0 0.602110 NaN 24 0 0
74 2022 25 82.0 19780.0 3571.0 7739.0 0.461 2626.0 0.519 1393.0 ... 35.0 26.0 115.6 56.0 26.0 11.182765 NaN 24 1 0
75 2022 30 82.0 19855.0 3246.0 6954.0 0.467 2132.0 0.531 1416.0 ... 33.9 25.5 110.0 53.0 29.0 9.139626 NaN 24 1 0
76 2022 31 82.0 19755.0 3429.0 7331.0 0.468 2276.0 0.544 1459.0 ... 36.5 23.9 115.5 51.0 31.0 9.273393 NaN 24 1 0
77 2022 33 82.0 19780.0 3411.0 7458.0 0.457 2200.0 0.540 1474.0 ... 32.9 25.7 115.9 46.0 36.0 6.222438 NaN 24 1 0
78 2022 38 82.0 19755.0 3294.0 7212.0 0.457 2421.0 0.528 1501.0 ... 33.2 25.0 109.3 36.0 46.0 0.240126 NaN 24 1 0
79 2022 40 82.0 19780.0 3088.0 7069.0 0.437 2006.0 0.497 1473.0 ... 34.6 21.9 106.5 37.0 45.0 -0.549704 NaN 24 0 0

50 rows × 47 columns

Here we decide to fill in null or absent values with the mean for those values within that given year and for the specific team the value falls under. This is to avoid mixing up from other teams.

In [ ]:
cols_to_fill = encoded_final_team_df.columns.tolist()

for col in cols_to_fill:
    encoded_final_team_df[col] = encoded_final_team_df.groupby(['TEAM', 'SEASON'])[col].transform(lambda x: x.fillna(x.mean()))
encoded_final_team_df
In [ ]:
encoded_final_team_df['CHAMPION'].value_counts()[1]

We use feature bagging, such as considering the square root of the total number of features, to determine the number of features to use in Random Forests. This technique involves randomly selecting a subset of features at each split, which helps prevent overfitting and increases the diversity among the trees in the ensemble. By dropping some columns based on their low information gain score relative to other columns, we prioritize the inclusion of features that contribute the most to the predictive power of the model while efficiently managing computational resources and reducing the risk of overfitting.

In [ ]:
columns_to_drop = [
# 'SEASON',
#  'TEAM',
 'GAMES_x',
 'MINUTES_PLAYED',
#  'FIELD_GOALS_MADE',
#  'FIELD_GOALS_ATTEMPTED',
 'FIELD_GOALS_PERCENTAGE',
#  'TWO_POINTERS_MADE',
 'TWO_POINTERS_PERCENTAGE',
 'FREE_THROWS_MADE',
#  'FREE_THROWS_ATTEMPTED',
 'FREE_THROW_PERCENTAGE_x',
 'DEFENSIVE_REBOUND',
#  'TOTAL_REBOUNDS',
 'ASSISTS',
 'STEALS',
 'BLOCKS',
 'TURNOVERS',
 'POINTS',
 'L_x',
 'AGE',
 'PW',
 'PL',
#  'MARGIN_OF_VICTORY',
 'STRENGTH_OF_SCHEDULE',
#  'SIMPLE_RATING_SYSTEM',
 'OFFENSIVE_RATING',
 'DEFENSIVE_RATING',
 'NET_RATING',
 'TRUE_SHOOTING_PERCENTAGE',
 'EFFECTIEV_FIELD_GOAL_PERCENTAGE',
 'TURNOVER_PERCENTAGE',
 'OPPONENT_EFFECTIVE_FIELD_GAL',
 'OPPONENT_FREE_THROW_PERCENTAGE.1',
 'fg_per_game',
 'FIELD_GOAL_PERCENTAGE',
 'TWO_POINTER_PERCENTAGE',
 'DEFENSIVE_REBOUND_PER_GAME',
 'ASSIST_PER_GAME',
 'POINTS_PER_GAME',
 'W',
 'L',
#  'Composite Score',
 'YEAR',
 'CHAMPIONSHIP_TEAM',
 'PLAYOFFS',
#  'CHAMPION'
]

encoded_final_team_df.drop(columns_to_drop, axis=1, inplace=True)

Here we obtain the X (features) and y (target) variables necessary in our random forest classifier. Since Champion is the dependent variable we drop it from the dataframe before passing it to X. On the other hand it is the only column considered when submitting to the y variable.

In [ ]:
#X = encoded_final_team_df.drop('CHAMPION', axis=1)

X = encoded_final_team_df.drop(
 'CHAMPION', axis=1)
y = encoded_final_team_df['CHAMPION']

The dataset is then split into training and testing sets, 20% of the dataset is used as the testing set and 80% as the training set then we make 100 trees in our forest.

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
In [ ]:
modelF = RandomForestClassifier(n_estimators=100, random_state=55)
modelF.fit(X_train, y_train)

The accuracy and confusion matrix are obtained.

In [ ]:
y_pred = modelF.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

A 5-fold cross validation is performed on our model, this is to provide a more reliable assessment on how our model performed then the cross-validation scores and the average cross-validation score are obtained.

In [ ]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross validation
scores = cross_val_score(modelF, X, y, cv=5)

# Print cross-validation scores
print('Cross-validation scores: ', scores)

# Print average cross-validation score
print('Average cross-validation score: ', scores.mean())

A classification report is obtained.

Let's break it down below:

  1. Precision: Precision measures how many of the positive predictions made by the classifier are actually correct. It is calculated as the ratio of true positive predictions to the total number of positive predictions. In our case, the precision for class 1 (the positive class) is 0.40, which means that only 40% of the positive predictions were accurate.

  2. Recall (Sensitivity): Recall, also known as sensitivity or true positive rate, measures how many of the actual positive instances were correctly predicted by the classifier. It is calculated as the ratio of true positive predictions to the total number of actual positive instances. In our case, the recall for class 1 is 0.18, indicating that only 18% of the actual positive instances were correctly identified.

  3. F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. It is especially useful for imbalanced datasets where precision and recall may be skewed. The F1-score is calculated as follows:

    $$ F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

    For class 1, the F1-score is 0.25, which reflects the trade-off between precision and recall.

  4. Support: The support represents the number of actual instances in each class. In your case, there are 11 instances of class 1.

In summary, our classifier performs well in terms of precision for class 0 (negative class), but its recall and F1-score for class 1 (positive class) are relatively low.

The low support (number of actual instances) can certainly impacts the recall.

When dealing with a small number of instances in the positive class (class 1), the classifier has fewer opportunities to correctly predict them. If the model misclassifies even a few instances, it significantly affects the recall because the denominator (total actual positive instances) is small. In our case, with only 11 instances of class 1, the model’s ability to capture them accurately is limited.

Imbalanced datasets (where one class has significantly fewer instances than the other) can lead to biased performance metrics. In such cases, the model may prioritize the majority class (class 0) due to its higher prevalence, resulting in lower recall for the minority class (class 1).

In [ ]:
from sklearn.metrics import classification_report

# Get predictions
y_pred = modelF.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))

We create a bar plot of the feature importances as determined by your Random Forest model. Feature importance gives a score for each feature of our data, the higher the score more important or relevant is the feature towards your output variable. This visualization helps us understand which features are driving the predictions of your model, which can be very useful in understanding the underlying processes that the model is using to make predictions.

In [ ]:
importances = modelF.feature_importances_
# Get the index of importances from greatest importance to least
sorted_index = np.argsort(importances)[::-1]
x = range(len(sorted_index))

# Create tick labels
labels = np.array(X.columns)[sorted_index]
plt.bar(x, importances[sorted_index], tick_label=labels)

# Rotate tick labels to vertical
plt.xticks(rotation=90)
plt.show()

From the outset, our goal was clear and ambitious: predict the NBA champion for the upcoming year. Armed with data science tools and fueled by our passion for basketball, we embarked on this exhilarating journey.

We navigated the vast landscape of our dataset, delving into player statistics and team performance. Our journey led us through intricate details, where we wrangled and transformed raw data into a format our machine learning model could comprehend. Irrelevant noise was discarded, and significant patterns emerged from the numbers.

Now, standing at the precipice of discovery, we've completed rigorous data preprocessing, meticulous feature selection, and model training. The anticipation builds as we feed our processed data into our finely tuned model. This moment represents the culmination of our efforts—a transformation from raw data to a prediction for the future.

Without further ado, let the power of machine learning guide us.

In [ ]:
present_Advanced = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/2024 Advanced.csv")
present_Total = pd.read_csv("/content/drive/MyDrive/FINAL PROJECT CSVs/2024 Total.csv")
In [ ]:
present_Advanced.columns.tolist()
In [ ]:
present_Advanced = present_Advanced.drop(['Rk',
 'Unnamed: 18',
 'Unnamed: 23',
 'eFG%',
 'TURNOVER_PERCENTAGE.1',
 'DEFENSIVE_REBOUND_PERCENTAGE',
 'FT/FGA',
 'Unnamed: 28',
 'Arena',
 'Attend.',
 'Attend./G'],  axis = 1)
In [ ]:
present_Total.columns.tolist()
In [ ]:
present_Total = present_Total.drop('Rk', axis=1)

Our validation df is created by merging both 2024 dataframe the total dataframe and the advanced dataframe, this is done on the TEAM column.

In [ ]:
Testing_df =  present_Total.merge(present_Advanced, on='TEAM')
In [ ]:
Testing_df.columns.tolist()

Calculating the composite score for the data we are about to test

In [ ]:
numeric_cols = Testing_df.select_dtypes(include=np.number)

# Calculate correlation with a win
correlation_with_w = numeric_cols.apply(lambda x: x.corr(Testing_df['W']))


# Standardize numeric variables
normalized_variables = (numeric_cols - numeric_cols.mean()) / numeric_cols.std()


# Calculate the composite score, taking into account the direction of correlation
Testing_df['Composite Score'] = normalized_variables.multiply(correlation_with_w, axis=1).sum(axis=1)

We drop the columns that have the columns that dont match what was previously used to train our ML model above. This makes sure the features used in our random forest classification model match and we are able to pass it in for prediction.

In [ ]:
Testing_df = Testing_df.drop([#'TEAM',
 'PLAYOFFS_x',
 'GAMES',
 'MINUTES_PLAYED',
 #'FIELD_GOALS_MADE',
#  'FIELD_GOALS_ATTEMPTED',
 'FIELD_GOALS_PERCENTAGE',
 'THREE_POINTERS_MADE',
 'THREE_POINTERS_ATTEMPTED',
 'THREE_POINTERS_PERCENTAGE',
#  'TWO_POINTERS_MADE',
 'TWO_POINTERS_ATTEMPTED',
 'TWO_POINTERS_PERCENTAGE',
 'FREE_THROWS_MADE',
#  'FREE_THROWS_ATTEMPTED',
 'FREE_THROWS_PERCENTAGE',
 'OFFENSIVE_REBOUNDS',
 'DEFENSIVE_REBOUNDS',
 #'TOTAL_REBOUNDS',
 'ASSISTS',
 'STEALS',
 'BLOCKS',
 'TURNOVERS',
 'PERSONAL_FOULS',
 'POINTS',
 'PLAYOFFS_y',
 'AGE',
 'W',
 'L',
 'PW',
 'PL',
 #'MARGIN_OF_VICTORY',
 'STRENGTH_OF_SCHEDULE',
 #'SIMPLE_RATING_SYSTEM',
 'OFFENSIVE_RATING',
 'DEFENSIVE_RATING',
 'NET_RATING',
  #'Composite Score',
 'PACE',
 'FREE_THROW_ATTEMPT_RATE',
 'THREE_POINT_ATTEMPT_RATE',
 'TRUE_SHOOTING_PERCENTAGE',
 'EFFECTIVE_FIELD_GOAL_PERCENTAGE',
 'TURNOVER_PERCENTAGE',
 'OFFENSIVE_REBOUND_PERCENTAGE',
 'FREE_THROWS_PER_FIELD_GOAL_ATTEMPT'], axis = 1)

A season column is added and filled up with the value 2024 representing the year.

In [ ]:
Testing_df ['SEASON'] = 2024
In [ ]:
Testing_df
In [ ]:
encoded_final_team_df.columns.tolist()
In [ ]:
Testing_df.columns.tolist()

Here we make the SEASON column the first column so that it matches the order our features occured in the training model.

In [ ]:
cols = ['SEASON'] + [col for col in Testing_df.columns if col != 'SEASON']
Testing_df = Testing_df[cols]

Label encoding is applied to our TEAM column, this turns the values found in this column into a format that can be processed by our ML model.

In [ ]:
#le = LabelEncoder()
Testing_df['TEAM'] = le.fit_transform(Testing_df['TEAM'])
Testing_df.head(50)

Our trained model is used for the long waited prediction we pass our dataframe "testing_df" into it.

In [ ]:
encoded_final_team_df.columns.tolist()
In [ ]:
Testing_df.columns.tolist()
In [ ]:
predictions = modelF.predict(Testing_df)

We then observe that only one array index has value 1. This is a positive result. This 1 value indicates the team at that index is going to be the champion, and the others with 0 values are unfortunately not going to be able to get the long desired prize.

In [ ]:
predictions
In [ ]:
# Add the predictions as a new column to the DataFrame
Testing_df['predictions'] = predictions

Testing_df

When we filter our testing_df based on the prediction we can see that the winner is Team 1. Team 1 before we label encoded happens to be the Boston Celtics

In [ ]:
# Filter the DataFrame where 'predictions' is 'yes'
filtered_df = Testing_df[Testing_df['predictions'] == 1]

# # Get the 'TEAM' value
team_value = filtered_df['TEAM'].values[0]

# Retrieve the original team name using the mapping dictionary
original_team_name = le.inverse_transform([team_value])

# Print the original team name
print("And our predicted winner is:", original_team_name)

The Boston Celtics!!¶

VI - Visualization¶

In [ ]:
# Export the first three decision trees from the forest

import graphviz
from sklearn.tree import export_graphviz

for i in range(3):
    tree = modelF.estimators_[i]
    dot_data = export_graphviz(tree,
                               feature_names=X_train.columns,
                               filled=True,
                               max_depth=2,
                               impurity=False,
                               proportion=True)
    graph = graphviz.Source(dot_data)
    graph.render(f'tree_{i}', format='png')
    display(graph)

Insights and Conclusion¶

In our recent data science project, we leveraged historical data on player performance, opponent statistics, and team attributes to predict the 2024 NBA champion. Our extensive data cleaning and processing led us to identify the most relevant features for our predictive model. Interestingly, we found that features extracted from team-related data sources provided the most predictive power.

We utilized Random Forests as our primary machine learning model due to its ability to handle large datasets with numerous features while effectively managing overfitting. This approach enabled us to capture complex relationships within the data and make robust predictions about future championship outcomes.

Our model’s prediction pointed towards the Boston Celtics as the potential victors of the NBA 2024 championship. This insight underscores the effectiveness of our approach in analyzing historical trends and team performance metrics to identify strong contenders for the title.

In conclusion, our project yielded valuable insights into the factors influencing team success in professional basketball. Our focus on relevant features and the use of Random Forests as our predictive model allowed us to provide actionable insights for stakeholders and enthusiasts alike. While our model’s forecast favors the Boston Celtics, we acknowledge the dynamic nature of sports and the potential for unexpected outcomes. Nonetheless, our methodology demonstrates the potential of data-driven approaches in enhancing decision-making processes and gaining deeper insights into the complex dynamics of competitive sports.

This project was a great learning experience, helping us develop skills in data cleaning and processing. The prediction of the Boston Celtics as potential winners reaffirms our belief in the effectiveness of our work as they are currently up 1-0 in the NBA playoffs semi-final. We unearthed valuable insights and presented visualizations that could spur further inquiries and investigations, potentially leading to positive outcomes for basketball, such as improved popularity, performance, scouting, and player development.

While we wouldn’t advise using our prediction as a sure bet due to the preliminary nature of our model and the observed imprecision in the values, we hope that the outcome of this year’s NBA season reflects the results of our report. We believe the low scores for the positive outcome in our model are due to our data being unbalanced. For future research, we recommend considering other ensemble methods adapted for imbalanced data, such as Synthetic Minority Over-sampling Technique (SMOTE).

Looking ahead, we are hopeful that our work will contribute to the broader understanding of basketball dynamics and inspire further research in this exciting field.

In [ ]:
!pip install nbconvert
In [ ]:
%cd '/content/drive/My Drive/Colab Notebooks'
In [ ]:
!jupyter nbconvert --to html Final-Project.ipynb